Effectively Testing System Configurations of Critical IoT Analytics Pipelines
Morgan Geldenhuys, Lauritz Thamsen, Kain Kordian Gontarska, Felix Lorenz, Odej Kao
EEffectively Testing System Configurationsof Critical IoT Analytics Pipelines
Morgan K. Geldenhuys ∗ , Lauritz Thamsen ∗ , Kain Kordian Gontarska † , Felix Lorenz ∗ and Odej Kao ∗∗ Technische Universit¨at Berlin, Germany, { firstname.lastname } @tu-berlin.de † Hasso Plattner Institute, University of Potsdam, Germany, [email protected] ©2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, includingreprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, orreuse of any copyrighted component of this work in other works. DOI: 10.1109/BigData47090.2019.9005504
Abstract —The emergence of the Internet of Things has seen theintroduction of numerous connected devices used for the moni-toring and control of even Critical Infrastructures. Distributedstream processing has become key to analyzing data generatedby these connected devices and improving our ability to makedecisions. However, optimizing these systems towards specificQuality of Service targets is a difficult and time-consuming task,due to the large-scale distributed systems involved, the existenceof so many configuration parameters, and the inability to easilydetermine the impact of tuning these parameters.In this paper we present an approach for the effective testingof system configurations for critical IoT analytics pipelines. Wedemonstrate our approach with a prototype that we called
Timon which is integrated with Kubernetes. This tool allowspipelines to be easily replicated in parallel and evaluated todetermine the optimal configuration for specific applications.We demonstrate the usefulness of our approach by investigatingdifferent configurations of an exemplary geographically-basedtraffic monitoring application implemented in Apache Flink.
Index Terms —Distributed Stream Processing, Internet ofThings, Configuration Testing, Quality of Service.
I. I
NTRODUCTION
The Internet of Things (IoT) is an important emergingtechnological paradigm whereby billions of ubiquitous sensorand actuator devices are connected to enable the developmentof applications across a wide number of domains. An in-creasing number of these applications are expected to performin a capacity where the services they provide must meetcertain minimum Quality of Service (QoS) requirements. Thisis especially relevant for applications used in the real-timemonitoring and control of Critical Infrastructures, such as: hu-man health-care, transportation systems, electrical generation,natural disaster prediction, and telecommunications, to namebut a few [1]–[4].As the number of Internet-connected devices increases year-on-year, so does the volume of data being produced. Inorder to process these large data streams, Distributed StreamProcessing Frameworks (DSPF) such as Storm [5], and Flink[6] allow for the deployment of analytics pipelines whichutilize the processing power of a cluster of commodity nodes.Therefore, these frameworks are being utilized increasinglyfor the processing of IoT data streams [7]–[10]. Applicationsdeveloped within these systems are, in principle, required tooperate indefinitely on an unbounded stream of continuousdata in an environment where partial failures are to be expectedas these applications scale. Consequently, DSPFs feature high availability modes, implement fault tolerance mechanisms bydefault, and expose a rich set of continually evolving features.The end result being that the way in which these systems arecomposed has a high level of complexity and number of con-figuration options. A quick scan of the official documentationreveals that Flink has over 300 options across 28 categories ,and Spark [11] closer to 400 across 26 categories .System configuration has an impact on performance andreliability. Yet, with the vast number of options available fortuning, i.e. framework settings, job parameters, resource selec-tions, etc., the effects of which are not always well understoodor straightforward to determine. That is, finding the best com-bination of resource selections and system configurations isdifficult to estimate upfront both by experts and automaticallyby optimization tools as it is highly dependent on a numberof key factors: the analytics application which exhibits itsown unique operational characteristics; the cluster environment which is often not known before deployment and may varyover time, i.e. network topologies and physical hardware; and the data which is variable based on the characteristics of theinput data, loads from other applications, and ingestion rates.This is especially true in environments consisting of multipleconnected distributed systems making up larger applicationarchitectures, such as: resource managers, messaging queues,distributed file systems, scalable databases, etc. At the sametime, critical IoT applications typically have defined QoSrequirements with regards to performance, reliability, etc,which a configuration should meet [12].Currently, the most common way of tuning configurationparameters is for it to be done manually by performanceengineers, usually requiring several hours of investigationand testing [13]. These engineers require detailed knowledgeof the specific DSPF itself and the cluster environment inorder to find a system configuration that falls inline withthe aforementioned QoS constraints. Approaches have beenproposed at finding more precise and less time-consumingmethods for the automatic tuning of DSPF parameters [14]–[18]. These typically focus on only a limited number ofsettings, while there are numerous points of configurations inpractice with many dependencies between them. A solutionis needed which is complementary to these existing perfor- Flink Configuration. URL: https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html Spark Configuration. URL: https://spark.apache.org/docs/latest/configuration a r X i v : . [ c s . D C ] F e b ance modelling approaches, which provides an approach forgathering analytics data through testing and monitoring.For this purpose, we propose an approach for the effectivetesting of system configurations for critical IoT analyticspipelines in realistic conditions. We implemented our approachusing a prototype called Timon which allows for the testing ofmultiple different versions of system configurations in parallelwithin an environment that behaves like production using realstreaming data. In this way, operators can safely and efficientlyexperiment with potential system configurations to understandwhat impact these will have when used in production.The remainder of the paper is structured as follows: SectionII discusses the related work with regards to configuringDSPFs, Section III presents a typical architecture for criticalIoT analytics pipelines, Section IV presents our approach toconfiguration testing, Section V describes our evaluation wherewe present our experiments and findings, and Section VIdiscusses our findings with conclusions.II. R
ELATED W ORK
There exists a large body of work which addresses the prob-lem of system configuration. A number of these approachesfocus specifically on the tuning of parameters for DSPFs. Thistypically involves learning from analyzing actual executionsor historical data, model specific aspects of the systems, andthen adapt to actual conditions based on user requirements.We see our work as being orthogonal and complementary tothese contributions in that Timon provides a testing and metricgathering environment within which these approaches couldfunction. These approaches can be categorized as follows:
Rule-based : A gray-box heuristic approach where domainexperts work with users to establish a rule-set which isused to recommend suitable configurations. Bilal et al. [16]present an approach where users provide a parameter rankingin accordance with a priority level and specify whether anincrease in parameter value has an overall positive impact onlatency and throughput. This approach favors quickly findinga suitable configuration at the expense of optimality.
Model-based : An approach which is concerned with con-ducting experiments on a chosen set of configurations toobserve their performance. The results are used to train astatistical model for finding good configurations. Fischer etal. [14] and Trotter et al. [17] present an auto-tuning algo-rithm using Bayesian Optimization (BO) [19] to achieve highthroughput. Jamshidi et al. [15] likewise proposes BO, how-ever, it optimizes latency and leverages Gaussian Processes[20] to continuously estimate the mean and confidence intervalof a response variable at yet-to-be explored configurations.
Search-based : For this approach, an initial configuration isselected after which experiments are conducted sequentially.Each iteration uses the results of the previous to fit a statisticalmodel which is used to select the next configuration. Evolu-tionary Search Algorithms are typically adopted for automaticparameter tuning. Trotter et al. [17] proposes a method usinggenetic algorithms (GA) to optimize throughput. Additionally,in a later paper Trotter et al. [18] use GA to optimize throughput using SVM classifiers to further refine its searchof the configuration space. Bilal et al. [16] proposes a hill-climbing algorithm based on Latin Hypercube Sampling [21]while taking both latency and throughput metrics into account.
Learning-based : An approach which use online learningtechniques such as reinforcement learning to find the optimalconfiguration by reacting to feedback, i.e. metrics, from theDSPF at runtime [22]. This approach can be combined withoffline learning techniques to speed up convergence [23].To the best of our knowledge, no approach exists whichfocuses on parameter tuning for critical IoT analytics pipelinesexecuting in the production environment. For these applica-tions it is essential to consider the time-dependant nature ofIoT data streams when optimizing the performance of DSPFs.III. I O T S
TREAM P ROCESSING A RCHITECTURE
In this section we assume a typical architecture for theprocessing of IoT data streams, as depicted in Fig. 1. Herewe see a number of systems which, when combined, providea typical way of composing critical IoT analytics architectures.The distributed streaming platform is where raw IoT dataflows into the system from sensor devices and is stored in amessaging queue to await processing. This component is animplementation of the publisher/subscriber messaging patternwith Apache Kafka [24] being a distributed example of this.Subscribers such as the
IoT analytics pipeline register them-selves with the distributed streaming platform and consumemessages from targeted messaging queues when they becomesavailable. Additionally, messages that have already been pro-cessed and produce alarms or notifications, for instance, canbe written back to a separate messaging queue for furtherconsumption. The IoT analytics pipeline in turn consists ofa number of inter-dependent systems. These systems include: • Distributed Stream Processing Framework : responsiblefor executing the
IoT analytics application with the cur-rent configuration set in order to process messages readfrom the distributed streaming platform. Once processed,outputs are written back to the distributed streamingplatform if necessary and the monitoring & analytics datastore for archival. E.g. Apache Flink. • Monitoring & Analytics Data Store : this scalable databasewarehouses all sanitized messages and outputs of the IoTanalytics application. The archived data can be used foranalysis over a longer period of time to detect trends andother anomalies. E.g. Apache Cassandra [25]. • Metrics : it is important to monitor the health of the IoTanalytics application being executed. For this purpose,most DSPFs like Flink and Spark generally have a metricsystem that allows for the gathering and export of internalmetrics to external systems. These measurements shouldbe stored in a time series database to be accessed fromoutside the cluster. E.g. InfluxDB . • Chaos Daemon : an optional component, however, ifdeployed could provide the facility for using Chaos InfluxDB. URL: https://github.com/influxdata/influxdb istributed Streaming Platform IoT Analytics PipelineDistributed Stream Processing Framework Chaos DaemonMetrics
Monitoring & Analytics Data Store
IoT AnalyticsApplicationConfigurationSet
Fig. 1. Typical IoT stream processing architecture.
Engineering techniques to promote the development ofresilient services [26]. E.g. PowerfulSeal .In such a setup, the configuration set and/or IoT analyticsapplication could be designed to a standardized specificationallowing them to be traded out for alternate versions withoutcausing too much of a disruption to the overall environment.IV. A PPROACH
In distributed stream processing, system configuration has adirect impact on performance and reliability. Yet, quantifyingexactly how much of an impact is hard to ascertain. This iscomplicated by the existence of so many configuration param-eters and the fact that no two stream processing applicationswith even a minor difference in operational characteristics islikely to share the same optimal setup. Additionally, if a moreperformant version of the configuration were to be found,migrating to this version in the production environment couldcause significant disruptions. It is the goal of Timon to findsolutions to these problems.From a high level perspective, determining the best systemconfiguration for any particular stream processing applicationcan be found by comparing it to: the same application ex-ecuting in the same environment while ingesting the samedata but using alternate variations of the configuration set. Forsuch a test, variants should be executed in parallel, metricsrecorded over a specific time interval, and on conclusion,results compared to determine the best performer(s). This isthe general idea behind Timon, to provide a testing systemfor the efficient comparison of alternative configurations inthe production environment, i.e. testing with actually deployedsystems, at scale, and with actual live data streams.One of the key requirements of such a testing system wouldbe the ability of an environment where alternate deploymentscan be quickly replicated. Moreover, these deployments wouldneed to be isolated from each other in order to eliminateinterference and provided with access to external services.For this purpose we make use of two key enabling technolo-gies, i.e. OS-level virtualization and container orchestration.These technologies, when combined with Infrastructure-as-Code (IaC) processes, provide a mechanism for efficientlyinstantiating entire pipelines in parallel, assuming enoughresources are available in the cluster to do so.Fig. 2 provides an architectural overview of Timon and itsdependencies. In this diagram we can see the virtual cluster PowerfulSeal. URL: https://github.com/bloomberg/powerfulseal environment where both the production pipeline and shorter-lived configuration testing pipelines exist and are managed bythe container orchestrator. The flow of data is from left toright, from source, i.e. distributed streaming platform , to sink,i.e. client gateway . Each configuration testing pipeline is com-posed in the same way as the production pipeline and wouldbe executing the same
IoT analytics application . Importantly,all configuration testing pipelines will also process the sameinput data as the production pipeline. When notifications andalarms produced by a configuration testing pipelines needs tobe written back to the distributed streaming platform, they willeach have their own unique messaging queues.As part of the assumed
IoT stream processing architecture described in the previous section, each IoT analytics pipelinerecords metrics in a time series database. After all testingrounds have concluded, these metrics are collected, aggre-gated, and subsequently analyzed to determine if statisticallysignificant effects were observed. If a better performing test-ing pipeline is found than the current production pipeline,then a strategy can be followed to replace it. This strategyfirst involves migrating all data not stored in the productionpipeline to the new candidate pipeline, i.e. message queue anddata store. Next, user traffic needs to be redirected towardsthe new data sources via the client gateway. In this way theclient gateway can be thought of as a load balancer. Lastly,all redundant pipelines can then be safely decommissionedand resources recovered. It is important to note that thecopying of archived data over the network can be an expensiveoperation both in terms of time and network resources. It istherefore prudent to follow a strategy which will minimize thisimpact. Container orchestrators such as Kubernetes [27] offera number of mechanisms for working with persisted data .Apart from performance, reliability testing is also importantfor understanding the behavior of distributed systems. Theamount of things that can go wrong while a distributed systemis running is enormous. This is mainly due to the distributednature of all the components (which interact exclusivelythrough direct message parsing). It is virtually impossible topredict every possible failure mode and then engineer solutionsfor all the edge cases. Instead, a more realistic approach wouldbe to identify the weaknesses which cause these failures beforethey are triggered. This is where Chaos Engineering practicescan be used to complement traditional testing approaches. Persistent Volumes. URL: https://kubernetes.io/docs/concepts/storage/persistent-volumes istributed Streaming Platform
CONTAINER ORCHESTRATOR
Configuration Testing Pipeline(s)Distributed Stream Processing Framework Chaos DaemonMetrics
TIMON
ConfigurationSets ChaosScenariosPerformanceMetrics
Monitoring & Analytics Data Store
IoT AnalyticsApp ArchitectureSpecsProduction PipelineDistributed Stream Processing Framework Chaos DaemonMetrics
Monitoring & Analytics Data Store
ResourceManagerMasterIoT AnalyticsApplicationConfigurationSet Client GatewayTestingParameters
Fig. 2. Overview of Timon and system dependencies.
Therefore, Timon provides the ability to optionally define failure scenarios whereby failures are injected into the testingenvironment so that their impacts can be studied.V. E
VALUATION
Now we demonstrate that using Timon is both practical andbeneficial for distributed stream processing by presenting anexperiment conducted to evaluate the impact of using differentcheckpoint intervals on the overall performance of the system.Ensuring that DSPFs are fault tolerant while running in theproduction environment is imperative, however, quantifyingthis impact is important when considering QoS.
A. Prototype Implementation
Timon is essentially a software client which interfacesdirectly with a container orchestrator to automatically managethe instantiation / destruction of container groups. These con-tainer groups compose the inter-dependent systems which inturn make up the individual analytics pipelines. For implemen-tation, we use Docker and Kubernetes. These technologieswere selected because they provide an IaC approach for themanagement and provisioning of pipelines through machine-readable definition files. Additionally, Kubernetes provides amechanism for isolating pipelines through the use of names-paces, thereby minimizing the possibility of interference. Docker. URL: https://docker.com
B. IoT Data Stream & Analytics Application
For the purposes of this experiment, we created a simulationwhich mapped the streets and intersections of an area with onekilometer radius of central Berlin, Germany. In this area wegenerated a number of vehicles which travelled along variousroutes while providing an update message every 1 second.This update contained the: vehicle ID, vehicle type, currentlocation, speed, and direction. We use a sinusoidal functionto model traffic behaviors where the number of simultaneousvehicles is varied from a minimum of 25,000 to a maximumof 75,000 as a function of the time of day ( t in seconds). Thiswas done to more closely resemble real traffic behaviors ratherthan a linear gradient, i.e. the number of vehicles graduallyincreases until a peak point (rush hour), before graduallydecreasing again. Messages are submitted to an Apache Kafkacluster to await processing by the IoT analytics pipeline.The live stream of traffic messages stored in Apache Kafkaare consumed and analyzed using a DSPF. We use ApacheFlink for our experiments as it has native support for fault-tolerant stream processing and is known for high performanceand low latency [6]. A Flink cluster implements a master-slavearchitecture which consists of two processes: the JobManagerand the TaskManager. We developed an analytics applicationfor the purpose of analysis and define the following task: De-termine the total number of different vehicle types within thesimulation area accumulated over a 5 minute window period.esults were outputted to an Apache Cassandra database. Thistask uses ”group by” transformations where the stream waslogically partitioned into disjointed partitions. All messageswith the same key, therefore, were assigned to the same par-tition which allowed for a high level of parallelism. The longwindowing period of 5 minutes results in a larger accumulatingof state across the parallel tasks and therefore is a good fit fortesting fault tolerance behaviors. C. Experimental Setup
Our experimental setup consists of a 3 node Apache Kafkacluster and a 30 node Kubernetes cluster with HDFS [28].Node specifications are shown in Table I.
TABLE IC
LUSTER SPECIFICATIONS
Resource DetailsOS Ubuntu 18.04.3CPU Quadcore Intel Xeon CPU E3-1230 V2 3.30GHzMemory 16 GB RAMStorage 3TB RAID0 (3x1TB disks, linux software RAID)Network 1 GBit Ethernet NICSoftware Java v1.8, Apache Flink v1.9.0, Apache Kafkav2.3.0, Apache ZooKeeper v3.5.5, Docker v18.06,Kubernetes v1.15.3, Apache HDFS V2.8.3,Apache Cassandra v3.11.4, InfluxDB v1.6.4
We limit our configuration sets to vary a single variable,i.e. checkpoint interval, and choose 3 different variationsrepresenting short ( ), medium ( ), and long( ) intervals. Using Timon, we created 3 corre-sponding configuration testing pipelines in Kubernetes, eachcomposed of: • an Apache Flink (High Availability) cluster of 11 in-stances (1 JobManagers and 10 TaskManagers); • an Apache ZooKeeper cluster of 3 instances for dis-tributed coordination; • an Apache Cassandra cluster of 3 instances for archivalof processed data; and • a single InfluxDB time series database instance for col-lection of performance measurements.We define four key indicators to measure the performanceof each configuration set. These are: end-to-end latency , inDSPFs, is the time difference between the moment a messageis produced at the source task and the moment the tupleis produced at the output; input throughput , measured inmessages / sec, is the cumulative frequency at which messagesenter the source tasks of the dataflow, and; CPU utilization and heap memory utilization as a percentage.The total time to provision pipelines for each experimentaltesting round averaged 330 seconds. There were 5 roundsof testing conducted where metrics were recorded over 6hours with an increasing input throughput of 25,000 to 75,000messages per second.
D. Experimental Results
In the experiments, during the user-defined time interval,metrics were recorded and saved to a time-series database.After all testing rounds were concluded, Timon automaticallyretrieved these metrics, aggregated them, and analysis wasperformed. Parallelism for the dataflow job was set to eight.This resulted in one sink operator executing for each of theeight active TaskManagers. Latency, therefore, is recorded ateach sink operator separately. In order to address the observedvariance in the performance metrics, the median latency valuefrom all sink operators was chosen for each timestamp. Thesame was applied to the TaskManagers for CPU and memoryutilization. Furthermore, the experiment was run for a total offive testing rounds over the same time interval, i.e. time ofday. Again, the median values for each time step were chosento be the expected values. To further remove noise from thediagrams, exponential weighted moving average windows witha span of 1000 seconds were applied to the averaged metrics. input throughput (events/sec) e n d - t o - e n d l a t e n c y ( m s ) hours checkpoint interval Fig. 3. End-to-end latencies.
Fig. 3 shows the average latencies as measured at the sinkoperator across all TaskManagers. Here we can clearly seehow, as input throughput increases, performance deterioratesacross all configurations. Additionally, latencies decrease moredrastically the shorter the checkpoint interval as input through-put increases. It is visible there is a trade-off between QoSrequirements, i.e. the maximum amount of time before amessage should be processed, and the recovery time shoulda failure occur, i.e. the time for the system to go to a statewhere all messages are processed up until the failure.Fig. 4 and 5 show the average resource utilization forCPU and memory across all TaskManagers. Here we can seehow CPU and memory usage increases as input throughputincreases, however, in both cases utilization is low and thereis no need to provision more resources through configuration.The default setting for memory assigned to each TaskManageris 1 GB. Analysis of resource utilization for the JobManagerswas likewise performed. As is to be expected, CPU andmemory utilization was lower than the TaskManagers whilefollowing the same trend of the shorter the checkpoint interval,the more resources were consumed and this increases as inputthroughput increases. input throughput (events/sec) c p u u t ili z a t i o n ( % ) hours checkpoint interval Fig. 4. TaskManager CPU utilization. input throughput (events/sec) m e m o r y u t ili z a t i o n ( M B ) hours checkpoint interval Fig. 5. TaskManager memory utilization.
VI. C
ONCLUSION
This paper presented an approach which allows for the ef-fective testing of system configuration of critical IoT analyticspipelines in realistic conditions. For this, we assume a typicaldistributed architecture for critical IoT analytics pipelines andutilize containerization as well as container-orchestration inorder to replicate instances of this architecture in parallel, eachwith their own configuration set. We showed how using such atesting approach in the production environment can capture theruntime behaviors of stream processing applications in orderto investigate the individual performance of each configurationset. This was done by aggregating chosen metrics recordedover a defined number of testing rounds and then comparingthem. Ultimately, the choice of which configuration set isthe best performer should always consider pre-defined QoSrequirements.In the future, we would like to expand upon our approachin two ways. Firstly, we want to conducting experimentswith failure scenarios and including critical IoT analyticsapplications from different domains in addition to smart city.Secondly, we want to research flexible methods for automaticparameter tuning and selection of optimal performing config-urations. Nevertheless, this approach has already proven to bea helpful testing method and a usable tool. A
CKNOWLEDGMENTS
This work has been supported through grants by the GermanMinistry for Education and Research (BMBF) as Berlin BigData Center BBDC2 (funding mark 01IS18025A).R
EFERENCES[1] D. Georgakopoulos, P. Jayaraman, M. Fazia, M. Villari, and R. Ranjan,“Internet of things and edge cloud computing roadmap for manufactur-ing,”
IEEE Cloud Computing , vol. 3, pp. 66–73, 2016.[2] J. Jin, J. Gubbi, S. Marusic, and M. Palaniswami, “An informationframework for creating a smart city through internet of things,”
IEEEInternet of Things Journal , vol. 1, pp. 112–121, 2014.[3] B. Cheng, S. Longo, F. Cirillo, M. Bauer, and E. Kovacs, “Building a bigdata platform for smart cities: Experience and lessons from santander,” , pp. 592–599, 2015.[4] M. Lom, O. Pribyl, and M. Svitek, “Industry 4.0 as a part of smartcities,” , pp. 1–6, 2016.[5] A. Toshniwal, S. Taneja, A. Shukla, K. Ramasamy, J. M. Patel, S. Kulka-rni, J. Jackson, K. Gade, M. Fu, J. Donham, N. A. Bhagat, S. Mittal, andD. Ryaboy, “Storm@twitter,”
Proceedings of the 2014 ACM SIGMODInternational Conference on Management of Data , 2014.[6] P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, andK. Tzoumas, “Apache flink: Stream and batch processing in a singleengine,”
IEEE Data Eng. Bull. , vol. 38, pp. 28–38, 2015.[7] G. Morales, A. Bifet, L. Khan, J. Gama, and W. Fan, “Iot big datastream mining,”
Proceedings of the 22nd ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining , 2016.[8] A. Shukla, S. Chaturvedi, and Y. Simmhan, “Riotbench: An iot bench-mark for distributed stream processing systems,”
Concurrency andComputation: Practice and Experience , vol. 29, 2017.[9] S. Amini, I. Gerostathopoulos, and C. Prehofer, “Big data analyticsarchitecture for real-time traffic control,” , pp. 710–715, 2017.[10] G. Jansen, I. Verbitskiy, T. Renner, and L. Thamsen, “Schedulingstream processing tasks on geo-distributed heterogeneous resources,” , pp. 5159–5164, 2018.[11] M. Zaharia, M. Chowdhury, M. Franklin, S. Shenker, and I. Stoica,“Spark: Cluster computing with working sets,” in
HotCloud , 2010.[12] G. White, V. Nallur, and S. Clarke, “Quality of service approaches iniot: A systematic mapping,”
J. Syst. Softw. , vol. 132, pp. 186–203, 2017.[13] S. Allen, M. Jankowski, and P. Pathirana, “Storm applied: Strategies forreal-time event processing,” 2015.[14] L. Fischer, S. Gao, and A. Bernstein, “Machines tuning machines:Configuring distributed stream processors with bayesian optimization,” , pp. 22–31,2015.[15] P. Jamshidi and G. Casale, “An uncertainty-aware approach to optimalconfiguration of stream processing systems,” , pp. 39–48, 2016.[16] M. Bilal and M. Canini, “Towards automatic parameter tuning of streamprocessing systems,”
Proceedings of the 2017 Symposium on CloudComputing , 2017.[17] M. Trotter, G. Liu, and T. Wood, “Into the storm: Descrying optimal con-figurations using genetic algorithms and bayesian optimization,” , pp. 175–180, 2017.[18] M. Trotter, T. Wood, and J. Hwang, “Forecasting a storm: Divining op-timal configurations using genetic algorithms and supervised learning,” ,pp. 136–146, 2019.[19] B. Shahriari, K. Swersky, Z. Wang, R. Adams, and N. D. Freitas,“Taking the human out of the loop: A review of bayesian optimization,”
Proceedings of the IEEE , vol. 104, pp. 148–175, 2016.[20] C. Rasmussen and H. Nickisch, “Gaussian processes for machinelearning (gpml) toolbox,”
J. Mach. Learn. Res. , vol. 11, pp. 3011–3015,2010.[21] M. McKay, R. Beckman, and W. Conover, “A comparison of threemethods for selecting values of input variables in the analysis of outputfrom a computer code,”
Technometrics , vol. 42, pp. 55 – 61, 2000.22] L. Vaquero and F. Cuadrado, “Auto-tuning distributed stream processingsystems using reinforcement learning,”
ArXiv , vol. abs/1809.05495,2018.[23] X. Bu, J. Rao, and C. Xu, “A reinforcement learning approach toonline web systems auto-configuration,” , pp. 2–11, 2009.[24] J. Kreps, “Kafka : a distributed messaging system for log processing,”2011.[25] A. Lakshman and P. Malik, “Cassandra: a decentralized structuredstorage system,”
ACM SIGOPS Oper. Syst. Rev. , vol. 44, pp. 35–40,2010.[26] A. Basiri, N. Behnam, R. D. Rooij, L. Hochstein, L. Kosewski,J. Reynolds, and C. Rosenthal, “Chaos engineering,”
IEEE Software ,vol. 33, pp. 35–41, 2016.[27] A. Verma, L. Pedrosa, M. Korupolu, D. Oppenheimer, E. Tune, andJ. Wilkes, “Large-scale cluster management at google with borg,”
Proceedings of the Tenth European Conference on Computer Systems ,2015.[28] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The hadoopdistributed file system,”2010 IEEE 26th Symposium on Mass StorageSystems and Technologies (MSST)