[PDF] Effectively Testing System Configurations of Critical IoT Analytics Pipelines

Abstract

The emergence of the Internet of Things has seen the introduction of numerous connected devices used for the monitoring and control of even Critical Infrastructures. Distributed stream processing has become key to analyzing data generated by these connected devices and improving our ability to make decisions. However, optimizing these systems towards specific Quality of Service targets is a difficult and time-consuming task, due to the large-scale distributed systems involved, the existence of so many configuration parameters, and the inability to easily determine the impact of tuning these parameters. In this paper we present an approach for the effective testing of system configurations for critical IoT analytics pipelines. We demonstrate our approach with a prototype that we called Timon which is integrated with Kubernetes. This tool allows pipelines to be easily replicated in parallel and evaluated to determine the optimal configuration for specific applications. We demonstrate the usefulness of our approach by investigating different configurations of an exemplary geographically-based traffic monitoring application implemented in Apache Flink.

Full PDF

EEffectively Testing System Conﬁgurationsof Critical IoT Analytics Pipelines

Morgan K. Geldenhuys ∗ , Lauritz Thamsen ∗ , Kain Kordian Gontarska † , Felix Lorenz ∗ and Odej Kao ∗∗ Technische Universit¨at Berlin, Germany, { ﬁrstname.lastname } @tu-berlin.de † Hasso Plattner Institute, University of Potsdam, Germany, [email protected] ©2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, includingreprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, orreuse of any copyrighted component of this work in other works. DOI: 10.1109/BigData47090.2019.9005504

Abstract —The emergence of the Internet of Things has seen theintroduction of numerous connected devices used for the moni-toring and control of even Critical Infrastructures. Distributedstream processing has become key to analyzing data generatedby these connected devices and improving our ability to makedecisions. However, optimizing these systems towards speciﬁcQuality of Service targets is a difﬁcult and time-consuming task,due to the large-scale distributed systems involved, the existenceof so many conﬁguration parameters, and the inability to easilydetermine the impact of tuning these parameters.In this paper we present an approach for the effective testingof system conﬁgurations for critical IoT analytics pipelines. Wedemonstrate our approach with a prototype that we called

Timon which is integrated with Kubernetes. This tool allowspipelines to be easily replicated in parallel and evaluated todetermine the optimal conﬁguration for speciﬁc applications.We demonstrate the usefulness of our approach by investigatingdifferent conﬁgurations of an exemplary geographically-basedtrafﬁc monitoring application implemented in Apache Flink.

Index Terms —Distributed Stream Processing, Internet ofThings, Conﬁguration Testing, Quality of Service.

I. I

NTRODUCTION

The Internet of Things (IoT) is an important emergingtechnological paradigm whereby billions of ubiquitous sensorand actuator devices are connected to enable the developmentof applications across a wide number of domains. An in-creasing number of these applications are expected to performin a capacity where the services they provide must meetcertain minimum Quality of Service (QoS) requirements. Thisis especially relevant for applications used in the real-timemonitoring and control of Critical Infrastructures, such as: hu-man health-care, transportation systems, electrical generation,natural disaster prediction, and telecommunications, to namebut a few [1]–[4].As the number of Internet-connected devices increases year-on-year, so does the volume of data being produced. Inorder to process these large data streams, Distributed StreamProcessing Frameworks (DSPF) such as Storm [5], and Flink[6] allow for the deployment of analytics pipelines whichutilize the processing power of a cluster of commodity nodes.Therefore, these frameworks are being utilized increasinglyfor the processing of IoT data streams [7]–[10]. Applicationsdeveloped within these systems are, in principle, required tooperate indeﬁnitely on an unbounded stream of continuousdata in an environment where partial failures are to be expectedas these applications scale. Consequently, DSPFs feature high availability modes, implement fault tolerance mechanisms bydefault, and expose a rich set of continually evolving features.The end result being that the way in which these systems arecomposed has a high level of complexity and number of con-ﬁguration options. A quick scan of the ofﬁcial documentationreveals that Flink has over 300 options across 28 categories ,and Spark [11] closer to 400 across 26 categories .System conﬁguration has an impact on performance andreliability. Yet, with the vast number of options available fortuning, i.e. framework settings, job parameters, resource selec-tions, etc., the effects of which are not always well understoodor straightforward to determine. That is, ﬁnding the best com-bination of resource selections and system conﬁgurations isdifﬁcult to estimate upfront both by experts and automaticallyby optimization tools as it is highly dependent on a numberof key factors: the analytics application which exhibits itsown unique operational characteristics; the cluster environment which is often not known before deployment and may varyover time, i.e. network topologies and physical hardware; and the data which is variable based on the characteristics of theinput data, loads from other applications, and ingestion rates.This is especially true in environments consisting of multipleconnected distributed systems making up larger applicationarchitectures, such as: resource managers, messaging queues,distributed ﬁle systems, scalable databases, etc. At the sametime, critical IoT applications typically have deﬁned QoSrequirements with regards to performance, reliability, etc,which a conﬁguration should meet [12].Currently, the most common way of tuning conﬁgurationparameters is for it to be done manually by performanceengineers, usually requiring several hours of investigationand testing [13]. These engineers require detailed knowledgeof the speciﬁc DSPF itself and the cluster environment inorder to ﬁnd a system conﬁguration that falls inline withthe aforementioned QoS constraints. Approaches have beenproposed at ﬁnding more precise and less time-consumingmethods for the automatic tuning of DSPF parameters [14]–[18]. These typically focus on only a limited number ofsettings, while there are numerous points of conﬁgurations inpractice with many dependencies between them. A solutionis needed which is complementary to these existing perfor- Flink Conﬁguration. URL: https://ci.apache.org/projects/ﬂink/ﬂink-docs-stable/ops/conﬁg.html Spark Conﬁguration. URL: https://spark.apache.org/docs/latest/conﬁguration a r X i v : . [ c s . D C ] F e b ance modelling approaches, which provides an approach forgathering analytics data through testing and monitoring.For this purpose, we propose an approach for the effectivetesting of system conﬁgurations for critical IoT analyticspipelines in realistic conditions. We implemented our approachusing a prototype called Timon which allows for the testing ofmultiple different versions of system conﬁgurations in parallelwithin an environment that behaves like production using realstreaming data. In this way, operators can safely and efﬁcientlyexperiment with potential system conﬁgurations to understandwhat impact these will have when used in production.The remainder of the paper is structured as follows: SectionII discusses the related work with regards to conﬁguringDSPFs, Section III presents a typical architecture for criticalIoT analytics pipelines, Section IV presents our approach toconﬁguration testing, Section V describes our evaluation wherewe present our experiments and ﬁndings, and Section VIdiscusses our ﬁndings with conclusions.II. R

ELATED W ORK

There exists a large body of work which addresses the prob-lem of system conﬁguration. A number of these approachesfocus speciﬁcally on the tuning of parameters for DSPFs. Thistypically involves learning from analyzing actual executionsor historical data, model speciﬁc aspects of the systems, andthen adapt to actual conditions based on user requirements.We see our work as being orthogonal and complementary tothese contributions in that Timon provides a testing and metricgathering environment within which these approaches couldfunction. These approaches can be categorized as follows:

Rule-based : A gray-box heuristic approach where domainexperts work with users to establish a rule-set which isused to recommend suitable conﬁgurations. Bilal et al. [16]present an approach where users provide a parameter rankingin accordance with a priority level and specify whether anincrease in parameter value has an overall positive impact onlatency and throughput. This approach favors quickly ﬁndinga suitable conﬁguration at the expense of optimality.

Model-based : An approach which is concerned with con-ducting experiments on a chosen set of conﬁgurations toobserve their performance. The results are used to train astatistical model for ﬁnding good conﬁgurations. Fischer etal. [14] and Trotter et al. [17] present an auto-tuning algo-rithm using Bayesian Optimization (BO) [19] to achieve highthroughput. Jamshidi et al. [15] likewise proposes BO, how-ever, it optimizes latency and leverages Gaussian Processes[20] to continuously estimate the mean and conﬁdence intervalof a response variable at yet-to-be explored conﬁgurations.

Search-based : For this approach, an initial conﬁguration isselected after which experiments are conducted sequentially.Each iteration uses the results of the previous to ﬁt a statisticalmodel which is used to select the next conﬁguration. Evolu-tionary Search Algorithms are typically adopted for automaticparameter tuning. Trotter et al. [17] proposes a method usinggenetic algorithms (GA) to optimize throughput. Additionally,in a later paper Trotter et al. [18] use GA to optimize throughput using SVM classiﬁers to further reﬁne its searchof the conﬁguration space. Bilal et al. [16] proposes a hill-climbing algorithm based on Latin Hypercube Sampling [21]while taking both latency and throughput metrics into account.

Learning-based : An approach which use online learningtechniques such as reinforcement learning to ﬁnd the optimalconﬁguration by reacting to feedback, i.e. metrics, from theDSPF at runtime [22]. This approach can be combined withofﬂine learning techniques to speed up convergence [23].To the best of our knowledge, no approach exists whichfocuses on parameter tuning for critical IoT analytics pipelinesexecuting in the production environment. For these applica-tions it is essential to consider the time-dependant nature ofIoT data streams when optimizing the performance of DSPFs.III. I O T S

TREAM P ROCESSING A RCHITECTURE

In this section we assume a typical architecture for theprocessing of IoT data streams, as depicted in Fig. 1. Herewe see a number of systems which, when combined, providea typical way of composing critical IoT analytics architectures.The distributed streaming platform is where raw IoT dataﬂows into the system from sensor devices and is stored in amessaging queue to await processing. This component is animplementation of the publisher/subscriber messaging patternwith Apache Kafka [24] being a distributed example of this.Subscribers such as the

IoT analytics pipeline register them-selves with the distributed streaming platform and consumemessages from targeted messaging queues when they becomesavailable. Additionally, messages that have already been pro-cessed and produce alarms or notiﬁcations, for instance, canbe written back to a separate messaging queue for furtherconsumption. The IoT analytics pipeline in turn consists ofa number of inter-dependent systems. These systems include: • Distributed Stream Processing Framework : responsiblefor executing the

IoT analytics application with the cur-rent conﬁguration set in order to process messages readfrom the distributed streaming platform. Once processed,outputs are written back to the distributed streamingplatform if necessary and the monitoring & analytics datastore for archival. E.g. Apache Flink. • Monitoring & Analytics Data Store : this scalable databasewarehouses all sanitized messages and outputs of the IoTanalytics application. The archived data can be used foranalysis over a longer period of time to detect trends andother anomalies. E.g. Apache Cassandra [25]. • Metrics : it is important to monitor the health of the IoTanalytics application being executed. For this purpose,most DSPFs like Flink and Spark generally have a metricsystem that allows for the gathering and export of internalmetrics to external systems. These measurements shouldbe stored in a time series database to be accessed fromoutside the cluster. E.g. InﬂuxDB . • Chaos Daemon : an optional component, however, ifdeployed could provide the facility for using Chaos InﬂuxDB. URL: https://github.com/inﬂuxdata/inﬂuxdb istributed Streaming Platform IoT Analytics PipelineDistributed Stream Processing Framework Chaos DaemonMetrics

Monitoring & Analytics Data Store

IoT AnalyticsApplicationConﬁgurationSet

Fig. 1. Typical IoT stream processing architecture.

Engineering techniques to promote the development ofresilient services [26]. E.g. PowerfulSeal .In such a setup, the conﬁguration set and/or IoT analyticsapplication could be designed to a standardized speciﬁcationallowing them to be traded out for alternate versions withoutcausing too much of a disruption to the overall environment.IV. A PPROACH

In distributed stream processing, system conﬁguration has adirect impact on performance and reliability. Yet, quantifyingexactly how much of an impact is hard to ascertain. This iscomplicated by the existence of so many conﬁguration param-eters and the fact that no two stream processing applicationswith even a minor difference in operational characteristics islikely to share the same optimal setup. Additionally, if a moreperformant version of the conﬁguration were to be found,migrating to this version in the production environment couldcause signiﬁcant disruptions. It is the goal of Timon to ﬁndsolutions to these problems.From a high level perspective, determining the best systemconﬁguration for any particular stream processing applicationcan be found by comparing it to: the same application ex-ecuting in the same environment while ingesting the samedata but using alternate variations of the conﬁguration set. Forsuch a test, variants should be executed in parallel, metricsrecorded over a speciﬁc time interval, and on conclusion,results compared to determine the best performer(s). This isthe general idea behind Timon, to provide a testing systemfor the efﬁcient comparison of alternative conﬁgurations inthe production environment, i.e. testing with actually deployedsystems, at scale, and with actual live data streams.One of the key requirements of such a testing system wouldbe the ability of an environment where alternate deploymentscan be quickly replicated. Moreover, these deployments wouldneed to be isolated from each other in order to eliminateinterference and provided with access to external services.For this purpose we make use of two key enabling technolo-gies, i.e. OS-level virtualization and container orchestration.These technologies, when combined with Infrastructure-as-Code (IaC) processes, provide a mechanism for efﬁcientlyinstantiating entire pipelines in parallel, assuming enoughresources are available in the cluster to do so.Fig. 2 provides an architectural overview of Timon and itsdependencies. In this diagram we can see the virtual cluster PowerfulSeal. URL: https://github.com/bloomberg/powerfulseal environment where both the production pipeline and shorter-lived conﬁguration testing pipelines exist and are managed bythe container orchestrator. The ﬂow of data is from left toright, from source, i.e. distributed streaming platform , to sink,i.e. client gateway . Each conﬁguration testing pipeline is com-posed in the same way as the production pipeline and wouldbe executing the same

IoT analytics application . Importantly,all conﬁguration testing pipelines will also process the sameinput data as the production pipeline. When notiﬁcations andalarms produced by a conﬁguration testing pipelines needs tobe written back to the distributed streaming platform, they willeach have their own unique messaging queues.As part of the assumed

IoT stream processing architecture described in the previous section, each IoT analytics pipelinerecords metrics in a time series database. After all testingrounds have concluded, these metrics are collected, aggre-gated, and subsequently analyzed to determine if statisticallysigniﬁcant effects were observed. If a better performing test-ing pipeline is found than the current production pipeline,then a strategy can be followed to replace it. This strategyﬁrst involves migrating all data not stored in the productionpipeline to the new candidate pipeline, i.e. message queue anddata store. Next, user trafﬁc needs to be redirected towardsthe new data sources via the client gateway. In this way theclient gateway can be thought of as a load balancer. Lastly,all redundant pipelines can then be safely decommissionedand resources recovered. It is important to note that thecopying of archived data over the network can be an expensiveoperation both in terms of time and network resources. It istherefore prudent to follow a strategy which will minimize thisimpact. Container orchestrators such as Kubernetes [27] offera number of mechanisms for working with persisted data .Apart from performance, reliability testing is also importantfor understanding the behavior of distributed systems. Theamount of things that can go wrong while a distributed systemis running is enormous. This is mainly due to the distributednature of all the components (which interact exclusivelythrough direct message parsing). It is virtually impossible topredict every possible failure mode and then engineer solutionsfor all the edge cases. Instead, a more realistic approach wouldbe to identify the weaknesses which cause these failures beforethey are triggered. This is where Chaos Engineering practicescan be used to complement traditional testing approaches. Persistent Volumes. URL: https://kubernetes.io/docs/concepts/storage/persistent-volumes istributed Streaming Platform

CONTAINER ORCHESTRATOR

Conﬁguration Testing Pipeline(s)Distributed Stream Processing Framework Chaos DaemonMetrics

TIMON

ConﬁgurationSets ChaosScenariosPerformanceMetrics

Monitoring & Analytics Data Store

IoT AnalyticsApp ArchitectureSpecsProduction PipelineDistributed Stream Processing Framework Chaos DaemonMetrics

Monitoring & Analytics Data Store

ResourceManagerMasterIoT AnalyticsApplicationConﬁgurationSet Client GatewayTestingParameters

Fig. 2. Overview of Timon and system dependencies.

Therefore, Timon provides the ability to optionally deﬁne failure scenarios whereby failures are injected into the testingenvironment so that their impacts can be studied.V. E

VALUATION

Now we demonstrate that using Timon is both practical andbeneﬁcial for distributed stream processing by presenting anexperiment conducted to evaluate the impact of using differentcheckpoint intervals on the overall performance of the system.Ensuring that DSPFs are fault tolerant while running in theproduction environment is imperative, however, quantifyingthis impact is important when considering QoS.

A. Prototype Implementation

Timon is essentially a software client which interfacesdirectly with a container orchestrator to automatically managethe instantiation / destruction of container groups. These con-tainer groups compose the inter-dependent systems which inturn make up the individual analytics pipelines. For implemen-tation, we use Docker and Kubernetes. These technologieswere selected because they provide an IaC approach for themanagement and provisioning of pipelines through machine-readable deﬁnition ﬁles. Additionally, Kubernetes provides amechanism for isolating pipelines through the use of names-paces, thereby minimizing the possibility of interference. Docker. URL: https://docker.com

B. IoT Data Stream & Analytics Application

For the purposes of this experiment, we created a simulationwhich mapped the streets and intersections of an area with onekilometer radius of central Berlin, Germany. In this area wegenerated a number of vehicles which travelled along variousroutes while providing an update message every 1 second.This update contained the: vehicle ID, vehicle type, currentlocation, speed, and direction. We use a sinusoidal functionto model trafﬁc behaviors where the number of simultaneousvehicles is varied from a minimum of 25,000 to a maximumof 75,000 as a function of the time of day ( t in seconds). Thiswas done to more closely resemble real trafﬁc behaviors ratherthan a linear gradient, i.e. the number of vehicles graduallyincreases until a peak point (rush hour), before graduallydecreasing again. Messages are submitted to an Apache Kafkacluster to await processing by the IoT analytics pipeline.The live stream of trafﬁc messages stored in Apache Kafkaare consumed and analyzed using a DSPF. We use ApacheFlink for our experiments as it has native support for fault-tolerant stream processing and is known for high performanceand low latency [6]. A Flink cluster implements a master-slavearchitecture which consists of two processes: the JobManagerand the TaskManager. We developed an analytics applicationfor the purpose of analysis and deﬁne the following task: De-termine the total number of different vehicle types within thesimulation area accumulated over a 5 minute window period.esults were outputted to an Apache Cassandra database. Thistask uses ”group by” transformations where the stream waslogically partitioned into disjointed partitions. All messageswith the same key, therefore, were assigned to the same par-tition which allowed for a high level of parallelism. The longwindowing period of 5 minutes results in a larger accumulatingof state across the parallel tasks and therefore is a good ﬁt fortesting fault tolerance behaviors. C. Experimental Setup

Our experimental setup consists of a 3 node Apache Kafkacluster and a 30 node Kubernetes cluster with HDFS [28].Node speciﬁcations are shown in Table I.

TABLE IC

LUSTER SPECIFICATIONS

Resource DetailsOS Ubuntu 18.04.3CPU Quadcore Intel Xeon CPU E3-1230 V2 3.30GHzMemory 16 GB RAMStorage 3TB RAID0 (3x1TB disks, linux software RAID)Network 1 GBit Ethernet NICSoftware Java v1.8, Apache Flink v1.9.0, Apache Kafkav2.3.0, Apache ZooKeeper v3.5.5, Docker v18.06,Kubernetes v1.15.3, Apache HDFS V2.8.3,Apache Cassandra v3.11.4, InﬂuxDB v1.6.4

We limit our conﬁguration sets to vary a single variable,i.e. checkpoint interval, and choose 3 different variationsrepresenting short ( ), medium ( ), and long( ) intervals. Using Timon, we created 3 corre-sponding conﬁguration testing pipelines in Kubernetes, eachcomposed of: • an Apache Flink (High Availability) cluster of 11 in-stances (1 JobManagers and 10 TaskManagers); • an Apache ZooKeeper cluster of 3 instances for dis-tributed coordination; • an Apache Cassandra cluster of 3 instances for archivalof processed data; and • a single InﬂuxDB time series database instance for col-lection of performance measurements.We deﬁne four key indicators to measure the performanceof each conﬁguration set. These are: end-to-end latency , inDSPFs, is the time difference between the moment a messageis produced at the source task and the moment the tupleis produced at the output; input throughput , measured inmessages / sec, is the cumulative frequency at which messagesenter the source tasks of the dataﬂow, and; CPU utilization and heap memory utilization as a percentage.The total time to provision pipelines for each experimentaltesting round averaged 330 seconds. There were 5 roundsof testing conducted where metrics were recorded over 6hours with an increasing input throughput of 25,000 to 75,000messages per second.

D. Experimental Results

In the experiments, during the user-deﬁned time interval,metrics were recorded and saved to a time-series database.After all testing rounds were concluded, Timon automaticallyretrieved these metrics, aggregated them, and analysis wasperformed. Parallelism for the dataﬂow job was set to eight.This resulted in one sink operator executing for each of theeight active TaskManagers. Latency, therefore, is recorded ateach sink operator separately. In order to address the observedvariance in the performance metrics, the median latency valuefrom all sink operators was chosen for each timestamp. Thesame was applied to the TaskManagers for CPU and memoryutilization. Furthermore, the experiment was run for a total ofﬁve testing rounds over the same time interval, i.e. time ofday. Again, the median values for each time step were chosento be the expected values. To further remove noise from thediagrams, exponential weighted moving average windows witha span of 1000 seconds were applied to the averaged metrics. input throughput (events/sec) e n d - t o - e n d l a t e n c y ( m s ) hours checkpoint interval Fig. 3. End-to-end latencies.

Fig. 3 shows the average latencies as measured at the sinkoperator across all TaskManagers. Here we can clearly seehow, as input throughput increases, performance deterioratesacross all conﬁgurations. Additionally, latencies decrease moredrastically the shorter the checkpoint interval as input through-put increases. It is visible there is a trade-off between QoSrequirements, i.e. the maximum amount of time before amessage should be processed, and the recovery time shoulda failure occur, i.e. the time for the system to go to a statewhere all messages are processed up until the failure.Fig. 4 and 5 show the average resource utilization forCPU and memory across all TaskManagers. Here we can seehow CPU and memory usage increases as input throughputincreases, however, in both cases utilization is low and thereis no need to provision more resources through conﬁguration.The default setting for memory assigned to each TaskManageris 1 GB. Analysis of resource utilization for the JobManagerswas likewise performed. As is to be expected, CPU andmemory utilization was lower than the TaskManagers whilefollowing the same trend of the shorter the checkpoint interval,the more resources were consumed and this increases as inputthroughput increases. input throughput (events/sec) c p u u t ili z a t i o n ( % ) hours checkpoint interval Fig. 4. TaskManager CPU utilization. input throughput (events/sec) m e m o r y u t ili z a t i o n ( M B ) hours checkpoint interval Fig. 5. TaskManager memory utilization.

VI. C

ONCLUSION

This paper presented an approach which allows for the ef-fective testing of system conﬁguration of critical IoT analyticspipelines in realistic conditions. For this, we assume a typicaldistributed architecture for critical IoT analytics pipelines andutilize containerization as well as container-orchestration inorder to replicate instances of this architecture in parallel, eachwith their own conﬁguration set. We showed how using such atesting approach in the production environment can capture theruntime behaviors of stream processing applications in orderto investigate the individual performance of each conﬁgurationset. This was done by aggregating chosen metrics recordedover a deﬁned number of testing rounds and then comparingthem. Ultimately, the choice of which conﬁguration set isthe best performer should always consider pre-deﬁned QoSrequirements.In the future, we would like to expand upon our approachin two ways. Firstly, we want to conducting experimentswith failure scenarios and including critical IoT analyticsapplications from different domains in addition to smart city.Secondly, we want to research ﬂexible methods for automaticparameter tuning and selection of optimal performing conﬁg-urations. Nevertheless, this approach has already proven to bea helpful testing method and a usable tool. A

CKNOWLEDGMENTS

This work has been supported through grants by the GermanMinistry for Education and Research (BMBF) as Berlin BigData Center BBDC2 (funding mark 01IS18025A).R

EFERENCES[1] D. Georgakopoulos, P. Jayaraman, M. Fazia, M. Villari, and R. Ranjan,“Internet of things and edge cloud computing roadmap for manufactur-ing,”

IEEE Cloud Computing , vol. 3, pp. 66–73, 2016.[2] J. Jin, J. Gubbi, S. Marusic, and M. Palaniswami, “An informationframework for creating a smart city through internet of things,”

IEEEInternet of Things Journal , vol. 1, pp. 112–121, 2014.[3] B. Cheng, S. Longo, F. Cirillo, M. Bauer, and E. Kovacs, “Building a bigdata platform for smart cities: Experience and lessons from santander,” , pp. 592–599, 2015.[4] M. Lom, O. Pribyl, and M. Svitek, “Industry 4.0 as a part of smartcities,” , pp. 1–6, 2016.[5] A. Toshniwal, S. Taneja, A. Shukla, K. Ramasamy, J. M. Patel, S. Kulka-rni, J. Jackson, K. Gade, M. Fu, J. Donham, N. A. Bhagat, S. Mittal, andD. Ryaboy, “Storm@twitter,”

Proceedings of the 2014 ACM SIGMODInternational Conference on Management of Data , 2014.[6] P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, andK. Tzoumas, “Apache ﬂink: Stream and batch processing in a singleengine,”

IEEE Data Eng. Bull. , vol. 38, pp. 28–38, 2015.[7] G. Morales, A. Bifet, L. Khan, J. Gama, and W. Fan, “Iot big datastream mining,”

Proceedings of the 22nd ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining , 2016.[8] A. Shukla, S. Chaturvedi, and Y. Simmhan, “Riotbench: An iot bench-mark for distributed stream processing systems,”

Concurrency andComputation: Practice and Experience , vol. 29, 2017.[9] S. Amini, I. Gerostathopoulos, and C. Prehofer, “Big data analyticsarchitecture for real-time trafﬁc control,” , pp. 710–715, 2017.[10] G. Jansen, I. Verbitskiy, T. Renner, and L. Thamsen, “Schedulingstream processing tasks on geo-distributed heterogeneous resources,” , pp. 5159–5164, 2018.[11] M. Zaharia, M. Chowdhury, M. Franklin, S. Shenker, and I. Stoica,“Spark: Cluster computing with working sets,” in

HotCloud , 2010.[12] G. White, V. Nallur, and S. Clarke, “Quality of service approaches iniot: A systematic mapping,”

J. Syst. Softw. , vol. 132, pp. 186–203, 2017.[13] S. Allen, M. Jankowski, and P. Pathirana, “Storm applied: Strategies forreal-time event processing,” 2015.[14] L. Fischer, S. Gao, and A. Bernstein, “Machines tuning machines:Conﬁguring distributed stream processors with bayesian optimization,” , pp. 22–31,2015.[15] P. Jamshidi and G. Casale, “An uncertainty-aware approach to optimalconﬁguration of stream processing systems,” , pp. 39–48, 2016.[16] M. Bilal and M. Canini, “Towards automatic parameter tuning of streamprocessing systems,”

Proceedings of the 2017 Symposium on CloudComputing , 2017.[17] M. Trotter, G. Liu, and T. Wood, “Into the storm: Descrying optimal con-ﬁgurations using genetic algorithms and bayesian optimization,” , pp. 175–180, 2017.[18] M. Trotter, T. Wood, and J. Hwang, “Forecasting a storm: Divining op-timal conﬁgurations using genetic algorithms and supervised learning,” ,pp. 136–146, 2019.[19] B. Shahriari, K. Swersky, Z. Wang, R. Adams, and N. D. Freitas,“Taking the human out of the loop: A review of bayesian optimization,”

Proceedings of the IEEE , vol. 104, pp. 148–175, 2016.[20] C. Rasmussen and H. Nickisch, “Gaussian processes for machinelearning (gpml) toolbox,”

J. Mach. Learn. Res. , vol. 11, pp. 3011–3015,2010.[21] M. McKay, R. Beckman, and W. Conover, “A comparison of threemethods for selecting values of input variables in the analysis of outputfrom a computer code,”

Technometrics , vol. 42, pp. 55 – 61, 2000.22] L. Vaquero and F. Cuadrado, “Auto-tuning distributed stream processingsystems using reinforcement learning,”

ArXiv , vol. abs/1809.05495,2018.[23] X. Bu, J. Rao, and C. Xu, “A reinforcement learning approach toonline web systems auto-conﬁguration,” , pp. 2–11, 2009.[24] J. Kreps, “Kafka : a distributed messaging system for log processing,”2011.[25] A. Lakshman and P. Malik, “Cassandra: a decentralized structuredstorage system,”

ACM SIGOPS Oper. Syst. Rev. , vol. 44, pp. 35–40,2010.[26] A. Basiri, N. Behnam, R. D. Rooij, L. Hochstein, L. Kosewski,J. Reynolds, and C. Rosenthal, “Chaos engineering,”

IEEE Software ,vol. 33, pp. 35–41, 2016.[27] A. Verma, L. Pedrosa, M. Korupolu, D. Oppenheimer, E. Tune, andJ. Wilkes, “Large-scale cluster management at google with borg,”

Proceedings of the Tenth European Conference on Computer Systems ,2015.[28] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The hadoopdistributed ﬁle system,”2010 IEEE 26th Symposium on Mass StorageSystems and Technologies (MSST)