[PDF] Towards AIOps in Edge Computing Environments

Abstract

Edge computing was introduced as a technical enabler for the demanding requirements of new network technologies like 5G. It aims to overcome challenges related to centralized cloud computing environments by distributing computational resources to the edge of the network towards the customers. The complexity of the emerging infrastructures increases significantly, together with the ramifications of outages on critical use cases such as self-driving cars or health care. Artificial Intelligence for IT Operations (AIOps) aims to support human operators in managing complex infrastructures by using machine learning methods. This paper describes the system design of an AIOps platform which is applicable in heterogeneous, distributed environments. The overhead of a high-frequency monitoring solution on edge devices is evaluated and performance experiments regarding the applicability of three anomaly detection algorithms on edge devices are conducted. The results show, that it is feasible to collect metrics with a high frequency and simultaneously run specific anomaly detection algorithms directly on edge devices with a reasonable overhead on the resource utilization.

Full PDF

aa r X i v : . [ c s . D C ] F e b Towards AIOps in Edge Computing Environments

Soeren Becker ∗ , Florian Schmidt ∗ , Anton Gulenko ∗ , Alexander Acker ∗ , Odej Kao ∗∗ Complex and Distributed IT-SystemsTU Berlin

Berlin, Germany { ﬁrstname } . { lastname } @tu-berlin.de ©2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, includingreprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, orreuse of any copyrighted component of this work in other works. Abstract —Edge computing was introduced as a technical en-abler for the demanding requirements of new network technolo-gies like 5G. It aims to overcome challenges related to centralizedcloud computing environments by distributing computationalresources to the edge of the network towards the customers. Thecomplexity of the emerging infrastructures increases signiﬁcantly,together with the ramiﬁcations of outages on critical use casessuch as self-driving cars or health care. Artiﬁcial Intelligencefor IT Operations (AIOps) aims to support human operatorsin managing complex infrastructures by using machine learningmethods. This paper describes the system design of an AIOpsplatform which is applicable in heterogeneous, distributed envi-ronments. The overhead of a high-frequency monitoring solutionon edge devices is evaluated and performance experimentsregarding the applicability of three anomaly detection algorithmson edge devices are conducted. The results show, that it is feasibleto collect metrics with a high frequency and simultaneously runspeciﬁc anomaly detection algorithms directly on edge deviceswith a reasonable overhead on the resource utilization.

Index Terms —edge, monitoring, anomaly detection, AIOps

I. I

NTRODUCTION

Artiﬁcial Intelligence for IT Operations (AIOps) describesthe process of maintaining and operating large IT infrastruc-tures using AI-supported methods and tools on different levels,e.g. automated anomaly detection and root cause analysis,remediation and optimization, as well as fully automated initia-tion of self-stabilizing activities. The automation is mandatoryto handle the increasing system complexity due to the vir-tualization and software-deﬁned everything trends, combinedwith the signiﬁcant increase in numbers of servers, devices,sensors. Service providers are aware of the need for always-on, dependable services and already introduced additionalintelligence to the IT-ecosystem, e.g. by employing networkand site reliability engineers, by deploying automated tools for24/7 monitoring, and capacity planning.Rapidly decreasing the reaction time by system admin-istrators is still highly encouraged and necessary due toperformance problems (tuning), to component/system failures(outages, degraded performance), or due to security incidents.Quick latency-bounded responses are demanded in multiplescenarios ranging from smart cities via automated manufactur-ing to autonomous driving [1]. In many situations, a responsewithin pre-deﬁned latency limits is not a soft requirement butcan have a serious impact on the system functionality in oureveryday life. A lot of efforts were spent in the last decadeto move the actors and sensors closer to the source of dataand reduce or even eliminate the uncertainty of the network performance, leading to the rise of edge and fog computing.The management of edge and fog computing environmentsexposes – next to the obvious increase of complexity throughquantity of devices – to a number of additional drawbacks. Thedevices are typically physically located outside of data centers.Consequently, the devices are vulnerable to damaging actions,intentionally (theft, damage, jamming, etc.) and naturally(weather inﬂuence, aging, interferences, etc.). Moreover, thespatial distribution of such devices prevents the application ofstandard data center procedures related to redundancy, accesscontrol, or maintenance. These drawbacks create a paradoxsituation: a vulnerable infrastructure has a decisive impacton our everyday life, as it delivers crucial data for manyautonomous and smart-X processes. Thus, it is important toenable the continuous analysis of all relevant measurements toensure correct minimal functionality of edge and fog devices.During previous work we have developed a self-healingpipeline which was extensively tested in cloud computing envi-ronments [2]. It consists of detecting autonomously anomalies,ﬁnd the root cause, and plan and execute appropriate actionsto resolve or mitigate the occurred problem. However, therequirements for edge and fog computing demand modiﬁca-tions to ensure the feasibility of the analysis pipeline in highlydistributed environments.We aim contributing to this goal by addressing: • Support heterogeneous CPU architectures to enable in-place analysis on edge devices of the self-healing pipeline • Show the feasibility to apply monitoring and AI modelson edge devices with limited compute resources • Environment-aware placement of analysis steps in theinfrastructure to prevent overloading of lightweight edgedevices and to comply with requirements of AI modelson the hardware.This paper describes the AIOPs platform ZerOps4E whichcombines monitoring and AI-based methods to learn thenormal behavior of the edge devices and to detect deviationsin this behavior indicating anomalies. We assume that edgedevices are capable of running at least a docker container (forexample Raspberry Pis). Thus, the ZerOps4E platform can beexecuted directly on the edge. This allows a local execution ofall monitoring and at least anomaly detection services, whichreduces the requirements on network bandwidth and providesa foundation for a low-latency processing. With a series ofexperiments, the performance and resource overhead of theonitoring agent and different anomaly detection algorithmson edge devices are tested to evaluate the feasibility of runningthose framework components directly at the edge.The remainder of the paper is organized as follows: SectionII outlines the related work whereas Section III describesthe background of self-healing analysis, which are appliedin ZerOps4E. Section IV proposes the overall architecture ofZerOps4E and introduces the main framework componentsfor data collection and self-healing analysis. In Section V theperformance experiments are illustrated and evaluated, whileSection VI concludes the main ﬁndings.II. R

ELATED W ORK

Anomaly detection, predictive maintenance and AIOps ingeneral are important topics especially for highly distributedenvironments such as edge computing infrastructures [3]. Boseet al. [4] for instance are applying anomaly detecion methodsin the predictive maintenance context: For the lifecycle of edgedevices they start with low accuracy models and only switchto higher accuracy models when anomalies are detected todecrease the energy consumption.Another approach [5] is leveraging federated learning totrain models directly on edge devices and aggregate theresulting weights by a central component. The authors areusing deep learning neural networks to identify anomaliesand further exchange model data between edge nodes and thecentral component when an internet connection is available.They show that federated models can produce similar resultsto non-distributed models for several data sets.Soualhia et al. [6] are proposing a framework similar toours: In contrast, they are mainly applying supervised machinelearning techniques to detect and predict faults whereas weutilize unsupervised methods. Components for pre-processing,fault detection and fault prediction are provided and evaluatedon a simulated edge computing environment while injectingsynthetic faults. Their approach shows promising results fornon-fatal CPU and HDD overload anomalies. The authorsfurther plan to test their components and models on resource-constrained environments.Furthermore, anomaly detection is also often applied todetect attacks or malicious behaviour, for example in [7]:Kozik et al. propose a platform to classify anomalous networktrafﬁc on edge devices by using extreme learning machineclassiﬁers trained in the cloud. Another approach is introducedin [8] where the authors are deploying anomaly detectionalgorithms in the fog layer instead of directly on the edgedevices: The proposed

Fog-Empowered mechanism is ableto improve the processing delay and energy consumptioncompared to other centralized and distributed methods byusing hyperellipsoidal clustering.Not only the scientiﬁc community has recognized the im-portance of AIOps, several industrial player are also workingon solutions:The vision of the Fixstream AIOps+ platform is to predictany issues that can impact business applications and to auto- https://ﬁxstream.com matically remediate these issues before they result in businessoutages. They combine business transactions, application andinfrastructure issues in a single root cause analysis and thusaim optimize IT resources to reduce the infrastructure costs.Moogsoft offers a closed source AIOps platform whichuses different machine learning algorithms to predict anoma-lies on a given data stream. The main purpose is recognizingroot causes of failures within large infrastructures. Based onsupervised training, Moogsoft trains a machine learning modelpredicting the root cause for similar failures in the future. Theuser can then choose a remediation that is afterwards storedin a database to be recommended for future problems.III. B ACKGROUND S ELF - HEALING A NALYSIS

For data collection, we utilize the open-source projectbitﬂow-collector . The monitoring service is able to collectresource metrics from several different hardware interfaces.The bitﬂow-collector monitors metrics for VMs i.e. throughlibvirt and physical servers by parsing the proc-ﬁle system .The monitoring service can be conﬁgured to collect differentmonitoring rates in arbitrary high frequencies while promisingto keep the resource overhead low.Gulenko et al. [2] described an analysis pipeline and archi-tecture to provide self-healing to cloud systems. The systemis build upon the open-source stream processing frameworkBitﬂow in order to enable the sending of monitoring data aswell as analysis results between machine learning algorithms.The analysis pipeline consists of the following steps: Anomaly detection : For each monitored component anindividual machine learning model is deployed. The modelapplies an unsupervised online algorithm, which is capableof detecting autonomously abnormal behavior of resourcepatterns. In case of abnormality, the root cause analysis isalerted and the state of all components are forwarded.

Root cause analysis : The root cause analysis investigatesthe actual source of the underlying problem, i.e. the compo-nent causing the anomaly. It requires knowledge about theinfrastructure’s horizontal and vertical dependencies.

Remediation engine (decision engine): The engine ag-gregates the analysis results and combines the decisions toschedule and manage the execution of appropriate actions.This service requires the existence of a catalogue of reme-diation actions, which are expected to be provided by experts.Furthermore, a mapping is required to match the anomaliesand component of cause to the action.This architecture provides a ﬂexible structure to applyseveral different machine learning approaches to the analysispipeline, making it ﬂexible to investigate different approachesin this ﬁeld. The proposed architecture lacks of the ﬂexibilityof placing analysis steps on physically distributed compute https://github.com/bitﬂow-stream/go-bitﬂow-collector https://libvirt.org/ https://github.com/bitﬂow-stream/ evices and only applies streaming connections between anal-ysis steps, causing highly intensive network trafﬁc. In orderto overcome these problems, we propose an adapted platform,which is specialized on the needs and requirements of highlydistributed and heterogeneous infrastructures. Additionally, wedescribe potential candidates of machine learning techniquesto be placed within the infrastructure.IV. Z ER O PS A. Data Collection

We are utilizing the bitﬂow-collector as it promisses lowresource consumption for monitoring, which is needed for low-powered edge devices. The data is converted into a binarytransport format and provided as a data stream which can beconsumed by the data analysis steps. In addition, the collectorwas extended with a plugin to automatically create a datasource object for the monitored node in the Kubernetes cluster.

B. Data Analysis

The self-healing analysis pipeline consists of several al-gorithms chained together in order to detect problems on acomponent level, while aggregating the results into a higher-level decision engine, which plans and executes appropriateactions. The data analysis pipeline is split into three parts: • Decentralized analysis : Processing and ﬁltering of datais applied in a ﬁrst place near or directly on the computedevice of the data source. An edge case is the appli-cability of applying anomaly detection as decentralizedcomponent as it might consume too much resources onthe low-powered edge devices. From its concept, theanomaly detection should be placed next to the datasource, whenever the compute capabilities are available.Otherwise, the anomaly detection has to be moved tonear by components or to the cloud. This edge caseis increasingly interesting as it provides beneﬁts whenplacing the detection directly on the edge device asnetwork trafﬁc can be saved. Therefore, we investigatethe feasibility of this key technique in the evaluation part. • Centralized analysis : This contrasting concept consistsof analysis pipelines, which need to be executed cen-tralized, for instance in a traditional cloud service. Thedecision engine, which orchestrates the execution ofremediation actions on a global level is such kind of analysis step, which requires the knowledge of the wholeinfrastructure. Consequently, such algorithms require theexistence of global information and due to their central-ized deployment, intensive computational resources canbe assured. In contrast, the latency might increase due tothe physical distances between monitored edge devicesand the cloud servers [11]. • Partly-centralized analysis : Besides the extreme casesof decentralized and centralized appliance of analysissteps, further cases exist in between. For example, rootcause analysis can be applied with respect to a networkslice or server rack instead of a global level.In order to ensure the transportation of data between theplaced algorithms, we leverage the Bitﬂow framework for highthroughput data streams, but on higher-level computations,we utilize message queues. Stream processing can providenear real-time insights [11] whereas aggregated events areexchanged using the event bus. ZerOps4E implements Rab-bitMQ as event bus to transfer events - i.e. results of thedecision engine - rather than streams.The implementation of the anomaly detection algorithmsfollow the principle of Identity Function and Threshold Model[12] to automatically adjust the anomaly detection modelto the evolving data stream in an unsupervised manner. Weintegrated Long-short term memory (LSTM) [12], BIRCH[13] and Autoregressive integrated moving average (ARIMA)[14] as reconstruction functions, while applying exponentialmoving average as dynamic threshold model. As anomaliesmight propagate between monitored components, we apply atime-based root cause analysis. Thus, components are statedas root cause components, where anomalies can be detectedearlier. Of course, this technique is limited in complex systems,we therefore recommend the survey by Sol´e et al. [15] forchoosing advanced approaches.Lastly, the decision engine gathers events of root causecomponents with additional information about the anomaliesand applies density grid pattern matching [16] to recommendappropriate remediation action selection. These actions areassumed to be provided beforehand by experts, but can beextended over time through reinforcement learning. Actionsare for example the migration of a virtual machine or recon-ﬁguration of a particular service. C. Operator Component

We extended the bitﬂow-k8s-operator , which orchestratesthe deployment of the distributed deployment of the analy-sis pipeline. It is responsible for parsing the infrastructuredependency model and scheduling analysis pipelines in theinfrastructure. It leverages the Kubernetes concept of customresource deﬁnitions (CRDs) to introduce two custom objectsin a Kubernetes cluster, data sources and data analysis steps.These custom resources are stored in Kubernetes and can bemanipulated in the same way as regular Kubernetes objects https://github.com/bitﬂow-stream/bitﬂow-k8s-operator ike Pods or Services. Furthermore, the controller watches forupdates to those CRDs and continuously ensure the desiredstate. Thus, it automatically creates and delete data analysiscontainers based on the custom resource deﬁnitions. Data Sources

This CRD object represents a data sourcesuch as a monitored node, virtual machine or pod in thesystem. It contains an URL for accessing data and an numberof labels that describe arbitrary properties of the data source.

Analysis Step

This object deﬁnes the analysis workloadin the form of a Kubernetes Pod template. It contains allinformation necessary to start a data analysis container. Theﬁrst parameter of an analysis step is a list of ingest selectorswhich describe the data sources that should be consumed andused by the step. These ingest selectors are matched against thelabels of existing data source objects in the system. All datasources which match with these ingest selectors are consid-ered appropriate for the analysis step. Besides the executioninstructions, the analysis step deﬁnition also contains initialvalues for all relevant hyperparameters of the algorithm.Since we assume that it is not feasible to run every type ofanalysis steps on lightweight edge devices, we extended theoperator with a region restriction for analysis steps. In edgecomputing environments the infrastructure is often divided intoseveral regions, i.e. a public cloud region and several edgeregions located anywhere in a smart city [1]. In addition, someof the edge devices may have additional hardware connected- for instance GPUs - which enhances the performance ofspeciﬁc analysis steps. When an analysis step was deﬁned witha region restriction, the operator only uses nodes which complywith that restriction for the scheduling of such analysis steps.

D. Model Repository

The model repository stores AI models of analysis steps.This is necessary to enable the warm start of for instanceanomaly detection algorithms, which furthers results in ashorter learning time or even no learning time at all. Inaddition, models can continuously be updated to adapt toseasonal changes. We are using the Redis in-memory database as a model repository since it shows promising performancein the IoT context [17] and further provides modules for AIand edge computing use cases .V. P ERFORMANCE EXPERIMENTS

Edge devices are typically restricted in terms of processingpower, memory and energy consumption [18]. Therefore, wefocus on the overhead of resource usage of for key technologyparts inside ZerOps4E: • Resource efﬁcient collecting of metrics on edge devices • Feasibility of deploying multiple anomaly detection mod-els on edge devicesIn addition, we compare the resource consumption betweenedge devices and commodity servers, which function as cloudinstances to be able to show key differences but also provide https://redis.io/ https://redislabs.com/redis-enterprise/redis-edge/ TABLE I: Hardware used for the performance evaluation.

CommodityServer RPI 3B RPI 4BCPU

Intel Xeon CPUE3-1230 V2 3.30GHz Cortex-A531.4GHz Cortex-A721.5GHz

Memory

Ethernet guidance for practical usage. For evaluation, we used thedevices depicted in Table I. The results of the commodityserver are compared to the performance on typical edgedevices - a Raspberry Pi 3B and Raspberry Pi 4B.

A. Experiment setup

The platform is envisioned to run on heterogeneous envi-ronments in terms of CPU architectures and therefore we haveadjusted our continuous integration (CI) pipeline to compilethe components for amd64 , arm32v7 , and arm64v8 . We areusing Docker to provide the build artifacts and create imagesfor each CPU architecture. At the end of each CI pipelinea Docker manifest is created which acts as a multi-archdocker image. This enables us to use the same deploymentscripts for every node in infrastructure, regardless of the CPUarchitecture. Therefore our, platform can rapidly be deployedeven on heterogeneous infrastructures. B. Monitoring Overhead

We evaluated the overhead of the collector with differentfrequencies while running on the aforementioned edge devices.On all devices the bitﬂow-collector was deployed within adocker container and the resource consumption was monitoredin order to capture the collector’s overhead on the CPU re-source consumption. We assume that high-frequent collectingof monitoring data provides beneﬁts for data analysis, but weexpect limitations within the resource usage overhead, whichintroduces further noise and limit the number of coexistingservice containers. The collecting frequency of the monitoringagent was increased by 100ms in an interval of [100ms, 10s].Figure 1 provides insights about the memory (a) and (b)CPU relative overhead to the monitored system. The leftdiagram shows that the memory consumption overhead is lessthan 2% for the monitoring of any of the hardware componentsand does not depend on the collection frequency. This isexpected as the collected metrics are directly transferred tothe local data analysis and not stored in memory. In contrast,the CPU utilization increases for high-frequent collection (e.g.100ms) compared to collecting metrics every 1 second. Thisis expected as the resource overhead of collecting increasesby the rate of parsing the relevant APIs. For both, memory aswell as CPU utilization, the edge devices consume a higherpercentage compared to the commodity server.The results depicted in Figure 1 show, that with a frequencyof 500ms the overhead of the collector amounts to only 2-3% CPU utilization on edge devices. Therefore, it is feasibleto deploy the bitﬂow-collector on lightweight edge deviceswithout interfering with running services in the infrastructure,while providing metrics for analysis in high frequency.

200 400 600 800 1 , Frequency in ms A vg . M e m . [ % ] Commodity ServerRaspberry Pi 4BRaspberry Pi 3B (a) Memory overhead of monitoring agent , Frequency in ms A vg . C P U [ % ] Commodity ServerRaspberry Pi 4BRaspberry Pi 3B (b) CPU overhead of monitoring agent

Fig. 1: CPU and memory overhead of the the bitﬂow-collectorrunning on a commodity server and Raspberry Pi’s withcollecting frequencies starting from 100ms to 10s.

C. Performance of Anomaly Detection Algorithms

The ARIMA, BIRCH, and LSTM-based anomaly detectionalgorithms were implemented in Java and integrated intothe Bitﬂow framework. We are using Docker as executionplatform, which allows us to limit the resources of runningprocesses through cgroups. For comparison, we performed thefollowing experiment on a commodity server, a Raspberry Pi3B and a Raspberry Pi 4B.For each experiment, a data set of 10,000 sample with 28monitoring metrics was read from a ﬁle and the respectivealgorithms were applied. While execution, we measured theaverage processing time per sample, and their standard devi-ation. In case of the commodity server, we started with noresource limitation at all (8 virtual CPUs) and decreased theallocated vCPUs by 0.1 for each following run until reachingthe minimum assignable resources at 0.1 vCPUs.For the Raspberry Pis, we compiled a custom Rasp-bian Linux kernel since the default kernel does not support cpu.cfs period us and cpu.cfs quota us cgroups used by theDocker engine for resource limitations. Afterwards, we ranthe same experiments starting with 4 vCPUs and decreasingit again by 0.1 vCPU until 0.1 vCPUs are reached.Figure 2 shows the results of the evaluation for the threedevices: commodity server (a), Raspberry Pi 4B (b), RaspberryPi 3B (c). All three plots show the same general behavior ofexponential increase when limiting the CPU resources towardszero. This is expected as, by limiting the CPU utilization,the algorithms have less computational resources, while stillrunning the same processing of samples. Consequently, theruntime per sample grows. The applied algorithms on the commodity server depict the shortest computation times com-pared to the edge devices as the commodity server utilizesa more powerful CPU chip. The results show that BIRCHand ARIMA perform more efﬁcient than LSTM in termsof time used per sample. This is due to the complexity ofthe applied algorithms: The anomaly detection algorithms arecontinuously trained and the deep learning models suffer froma higher computational complexity due to backpropagationwhile training. Consequently, the deep learning models areexpected to have a higher computation time.The results show, that BIRCH and ARIMA are computationefﬁcient on both, the edge device and commodity server. Inboth cases, we expect that for vCPU limitations with maximumof 0.6, the algorithms are capable to be applied in an AIOpsuse-case as the monitoring rates are expected to be 500ms orhigher. This is also depicted in Figure 3 which shows thecombined CPU overhead of the monitoring agent and theanomaly detection algorithms when samples are processedjust-in-time, meaning at the same speed new samples areprovided in the data stream.Applying deep learning models with continuous trainingmay not suite the needs of real-time processing on edgedevices as the computation time is likely to be higher than themonitoring rate. For such techniques, the anomaly detectionshould be moved to the cloud. Nevertheless, we showedthat deep learning approaches are capable of detecting inhigh quality [12]. For complex behavior pattern detection,such approaches might be applied for latency unrelated edgeuse cases, while real-time responsible devices should utilizeBIRCH or ARIMA.EIn conclusion, the limitation of resources can be used toestablish SLAs to the self-healing capabilities on edge devicesfor applying AI algorithms on edge devices. Furthermore,anomaly detection based on BIRCH or ARIMA are bothfeasible to be applied on edge devices, while LSTM basedanomaly detection should be considered to run in the cloud forpotential higher qualitative results. In future, the combinationof quick responses at the edge could be aggregated with higherqualitative results from cloud services.VI. C

ONCLUSION

The paper described the framework ZerOps4E in orderto apply self-healing capabilities to distributed heterogeneousinfrastructures. The architecture of ZerOps4E includes sev-eral different key components including resource monitoringand analysis steps, which are scheduled distributed and arecapable to run on small powered edge devices. We showedthe feasibility of deploying high-frequent resource monitor-ing agents directly on edge devices as they can be appliedresource-efﬁcient. Besides simple aggregation and ﬁlteringmethods applied on the monitored data inside edge devices,more sophisticated techniques might be applied like anomalydetection. For such methods, we showed that there existmultiple anomaly detection solutions, which are applicable foredge devices, while others like deep-learning models might benot applicable. For such cases, one has to decide on further , , , CPUs m s / s a m p l e BIRCHLSTMARIMA (a) Commodity Server

CPUs s / s a m p l e BIRCHLSTMARIMA (b) Raspberry Pi 4B

CPUs s / s a m p l e BIRCHLSTMARIMA (c)

Raspberry Pi 3B

Fig. 2: Illustrated processing time per sample under varying resource limitations. Note that the processing time for the RaspberryPi is depicted in seconds, whereas for the commodity server it is depicted in milliseconds.

200 400 600 800 1 , Frequency in ms A v e r a g e C P U u tili za ti on i n % Commodity ServerRaspberry Pi 4BRaspberry Pi 3B (a) CPU overhead (monitoring & ARIMA)

200 400 600 800 1 , Frequency in ms A v e r a g e C P U u tili za ti on i n % Commodity ServerRaspberry Pi 4BRaspberry Pi 3B (b) CPU overhead (monitoring & BIRCH)

200 400 600 800 1 , Frequency in ms A v e r a g e C P U u tili za ti on i n % Commodity ServerRaspberry Pi 4BRaspberry Pi 3B (c) CPU overhead (monitoring & LSTM)

Fig. 3: CPU utilization of the monitoring agent and anomaly detection algorithms for monitoring frequencies from 200ms to1s during just-in-time processing of samples.properties like quality of anomaly detection approaches andlatency guarantees between possible execution places.R

EFERENCES[1] W. Shi, J. Cao, Q. Zhang, Y. Li, and L. Xu, “Edge Computing: Visionand Challenges,”

IEEE Internet of Things Journal , vol. 3, no. 5, pp.637–646, 2016.[2] A. Gulenko, M. Wallschl¨ager, F. Schmidt, O. Kao, and F. Liu, “ASystem Architecture for Real-time Anomaly Detection in Large-scaleNFV Systems,” in

Procedia Computer Science , 2016.[3] W. Yu, F. Liang, X. He, W. G. Hatcher, C. Lu, J. Lin, and X. Yang,“A Survey on the Edge Computing for the Internet of Things,”

IEEEAccess , vol. 6, pp. 6900–6919, 2017.[4] S. K. Bose, B. Kar, M. Roy, P. K. Gopalakrishnan, and A. Basu,“AdepoS: Anomaly detection based power saving for predictive mainte-nance using edge computing,”

Proceedings of the Asia and South PaciﬁcDesign Automation Conference, ASP-DAC , pp. 597–602, 2019.[5] J. Schneible and A. Lu, “Anomaly detection on the edge,”

Proceedings- IEEE Military Communications Conference MILCOM , vol. 2017-October, pp. 678–682, 2017.[6] M. Soualhia, C. Fu, and F. Khomh, “Infrastructure fault detectionand prediction in edge cloud environments,”

Proceedings of the 4thACM/IEEE Symposium on Edge Computing, SEC , pp. 222–235, 2019.[7] R. Kozik, M. Chora´s, M. Ficco, and F. Palmieri, “A scalable distributedmachine learning approach for attack detection in edge computingenvironments,”

Journal of Parallel and Distributed Computing , vol. 119,pp. 18–26, 2018.[8] L. Lyu, J. Jin, S. Rajasegarar, X. He, and M. Palaniswami, “Fog-empowered anomaly detection in iot using hyperellipsoidal clustering,”

IEEE Internet of Things Journal , vol. 4, no. 5, pp. 1174–1184, 2017.[9] A. Gulenko, A. Acker, F. Schmidt, S. Becker, and O. Kao, “Bitﬂow:An in situ stream processing framework,” in . IEEE, 2020, pp. 182–187. [10] P. Bellavista and A. Zanni, “Feasibility of fog computing deploymentbased on docker containerization over RaspberryPi,”

ACM InternationalConference Proceeding Series , 2017.[11] T. Pfandzelter and D. Bermbach, “IoT data processing in the fog: Func-tions, streams, or batch processing?”

Proceedings - IEEE InternationalConference on Fog Computing, ICFC 2019 , pp. 201–206, 2019.[12] F. Schmidt, A. Gulenko, M. Wallschlager, A. Acker, V. Hennig, F. Liu,and O. Kao, “IFTM - Unsupervised Anomaly Detection for VirtualizedNetwork Function Services,” in

Proceedings - 2018 IEEE InternationalConference on Web Services, ICWS 2018 - Part of the 2018 IEEE WorldCongress on Services , 2018.[13] T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: An Efﬁcient DataClustering Method for Very Large Databases,”

SIGMOD Record (ACMSpecial Interest Group on Management of Data) , 1996.[14] F. Schmidt, F. Suri-Payer, A. Gulenko, M. Wallschlager, A. Acker,and O. Kao, “Unsupervised anomaly event detection for VNF servicemonitoring using multivariate online arima,” in

Proceedings of theInternational Conference on Cloud Computing Technology and Science,CloudCom , 2018.[15] M. Sol´e, V. Munt´es-Mulero, A. I. Rana, and G. Estrada, “Survey onModels and Techniques for Root-Cause Analysis,” pp. 1–18, 2017.[Online]. Available: http://arxiv.org/abs/1701.08546[16] A. Acker, F. Schmidt, A. Gulenko, and O. Kao, “Online density gridpattern analysis to classify anomalies in cloud and nfv systems,” in . IEEE, 2018, pp. 290–295.[17] L. D. Braulio, E. D. Moreno, D. D. De MacEdo, D. Kreutz, and M. A.Dantas, “Towards a Hybrid Storage Architecture for IoT,”

Proceedings- IEEE Symposium on Computers and Communications , vol. 2018-June,no. 5, pp. 470–473, 2018.[18] S. Yi, Z. Hao, Z. Qin, and Q. Li, “Fog computing: Platform andapplications,”

Proceedings - 3rd Workshop on Hot Topics in Web Systemsand Technologies, HotWeb 2015 , vol. 28, no. 1, pp. 73–78, 2016., vol. 28, no. 1, pp. 73–78, 2016.