Chiron: Optimizing Fault Tolerance in QoS-aware Distributed Stream Processing Jobs
9978-1-7281-6251-5/20/$31.00 ©2020 IEEE
Chiron: Optimizing Fault Tolerance in QoS-awareDistributed Stream Processing Jobs
Morgan K. Geldenhuys, Lauritz Thamsen, and Odej Kao
Technische Universit¨at Berlin, Germany, { firstname.lastname } @tu-berlin.de Abstract —Fault tolerance is a property which needs deeperconsideration when dealing with streaming jobs requiring highlevels of availability and low-latency processing even in case offailures where Quality-of-Service constraints must be adheredto. Typically, systems achieve fault tolerance and the abilityto recover automatically from partial failures by implementingCheckpoint and Rollback Recovery. However, this is an expensiveoperation which impacts negatively on the overall performanceof the system and manually optimizing fault tolerance for specificjobs is a difficult and time consuming task.In this paper we introduce
Chiron , an approach for au-tomatically optimizing the frequency with which checkpointsare performed in streaming jobs. For any chosen job, parallelprofiling runs are performed, each containing a variant of theconfigurations, with the resulting metrics used to model theimpact of checkpoint-based fault tolerance on performance andavailability. Understanding these relationships is key to minimiz-ing performance objectives and meeting strict Quality-of-Serviceconstraints. We implemented Chiron prototypically together withApache Flink and demonstrate its usefulness experimentally.
Index Terms —Distributed Stream Processing, Fault Tolerance,Profiling, Performance Modeling, Quality of Service
I. I
NTRODUCTION
Distributed Stream Processing (DSP) systems are critical tothe processing of vast amounts of data in real-time. It is herewhere events must traverse a graph of streaming operators toallow for the extraction of valuable information. There aremany scenarios where this information is at its most valuableat the time of data arrival and therefore systems must deliver apredictable level of performance. Examples of such scenariosinclude IoT data processing, click stream analytics, networkmonitoring, financial fraud detection, spam filtering, newsprocessing, etc. In order to keep up with the ever increasingdata processing demands, DSP systems have the ability toscale horizontally across a cluster of commodity nodes toprovide additional computation capabilities. However, as DSPsystems scale to ever larger sizes, the probability of individualnode failures increases and the need for more efficient faulttolerance mechanisms is attracting more attention.The ability to continue operating in an environment wherepartial failures are to be expected is a core concern in dis-tributed computing. Checkpoint and Rollback Recovery (CPR)is the most widely used fault tolerance strategy for DSPsystems. It improves the overall reliability of streaming jobsexecuting within them as well as the integrity of results. Thiscan be evidenced by its implementation in many of today’smost popular streaming platforms such as Storm [1], Spark[2], and Flink [3]. This strategy involves creating a snapshot of the global state contained within the system and saving it.This is done so that, should a failure occur, the individualworker nodes can be instructed to stop, rollback to the latestsave-point, and continue operating without the job failing [4].However, the CPR process is resource intensive and invari-ably introduces performance overhead that only grows moresignificant as requirements increase. This is mainly due tothree factors: the replication, transport, and storage of state thatneeds to be saved at regular intervals; event-time processing;and maintaining any fault tolerance guarantees, i.e. end-to-end exactly once semantics, while interacting with externalsystems. For DSP systems which use CPR, selecting an opti-mal checkpoint interval is key to ensuring high efficiency instreaming applications. Ultimately, there is a trade-off betweenthe performance overhead of regularly saving the global stateof the system and the runtime costs of failure recovery [5]. Inessence, setting the checkpoint frequency too low risks longerrecovery times and therefore allows for weaker availability,while setting the checkpoint frequency too high impacts onthe overall performance of normal stream processing.In this paper we address the problem faced by DSP pipelineoperators whose jobs must adhere to strict Quality of Service(QoS) constraints where the scaling out of resources is unde-sired or infeasible. It answers the question: How can a DSP beconfigured so that it provides an optimal end-to-end latency,while ensuring recovery from failures within a given timelimit? We propose
Chiron , an automatic profiling and runtimeprediction approach which models these relationships and thenemploy optimization techniques to find good fault toleranceconfiguration settings for DSP jobs with strict QoS require-ments. To the best of our knowledge, our approach is novelin its attempt to model both the performance and availabilityof DSP jobs through profiling and recommend an optimalcheckpoint interval based on user-defined QoS constraints forimproving overall performance. We also contribute to the bodyof knowledge by presenting an in-depth analysis of the mostcommon failure scenario, a new heuristic for calculating thetotal time a DSP job will be unavailable, and the results oftwo experiments conducted in support of our hypothesis.The remainder of the paper is structured as follows: SectionII presents an analysis of the problem; Section III introducesa new heuristic for measuring availability; Section IV presentsour approach to profiling, modeling, and optimization; SectionV describes our evaluation through experiments; Section VIthe related work; and Section VII summarizes our findings. a r X i v : . [ c s . D C ] F e b I. P
ROBLEM A NALYSIS
In this section we examine what happens when a failureoccurs and how a DSP system would respond. This allowsfor a greater understanding of the CPR process and anyconsiderations which need to be taken into account whileattempting to find good configurations. The common case ofa worker node failing during execution of a DSP job is used. i.CheckpointAverageIngress Rate(Iavg)MaximumIngress Rate(Imax) ii.Fail iii.Detect iv.Restore vi.Equalizevii. ReprocessEvents Time ( E ) viii.TimeoutInterval( T ) ix.RecoveryTime( R ) xi.Catch-up( C ) Time(ms)
IngressRate (msg/sec) x.WarmupPeriod( W ) v.Maximize Fig. 1: Worker node failure scenario.A master-worker architecture is assumed for this scenario,processing occurs at event time, and events are consumedfrom an external source providing exactly-once fault toleranceguarantees. Figure 1 illustrates the scenario of a worker nodefailing silently, i.e. crashing without sending a notification. It istherefore the responsibility of the system to determine that oneof the workers has stopped prematurely, revert the system backinto a consistent state, and catch-up to the latest event streamoffset. We describe events as they would naturally occur,chronologically from left to right (points i . to v .) along the x -axis. The y -axis represents the ingress rate, i.e. the cumulativefrequency at which messages enter the source operators of thedataflow. The points in time are:i. Checkpoint : The checkpoint process has completedsuccessfully and the distributed snapshot of the global stateis saved to disk along with the current event stream offsetof the external source. No messages processed prior to thispoint will need to be reprocessed in the event of a failure.ii.
Fail : At some point before the next checkpoint completessuccessfully, a worker node fails silently. The system hasnot yet detected this failure and will attempt to processand perform checkpoints as per normal. Checkpoints willultimately fail as the consensus between cluster nodescannot be reached. Events processed before this point butafter the last checkpoint will need to be reprocessed.iii.
Detect : At this point, the system has detected the failure.Generally DSP systems use a heartbeat protocol to timeoutnon-responsive nodes. Once the failure has been detected,all running tasks are cancelled and the worker nodes areinstructed to rollback to the last checkpoint.iv.
Restore : At this point, all worker nodes have reverted totheir previously saved state and processing will begin fromthe last committed event stream offset. As this offset isfurther back in time than the current timestamp, the systemwill attempt to ”catch up” and the ingestion rate will rapidly increase up until the maximum processing capacity of thesystem has been reached.v.
Maximize : At this point, the maximum processing capac-ity of the system has been reached, events will be processedat this fairly stable rate until the job has caught-up. Ingressrates are influenced by the number of available resources.vi.
Equalize : At this point, the DSP job has processed thebacklog of accumulated events and the job reaches equi-librium with the average ingress rate. Processing continuesas per normal and the extra resources which are no longernecessary are released.Fault tolerance is an inherently expensive operation whichrequires resources in order for a DSP job to be able to recover.After checkpointing, a distributed snapshot of the global stateis contained at the worker nodes which then needs to be copiedover the network to be stored in case it is required later. This isa network intensive operation and the frequency with whichthis process is initiated plus the size of each snapshot willhave a negative impact on the overall end-to-end latencies.This is especially important to consider in instances whereperformance requirements are defined in SLAs. When a failureoccurs, the DSP system will be unavailable for a period oftime. During this downtime, events will accumulate at theexternal source(s). This includes those which will need to bereprocessed as a result of the rollback. Importantly, the point atwhich processing resumes should not be considered the pointat which the system is once again available, the backlog ofaccumulated events will first need to be dealt with. We define
Total Recovery Time ( T RT ) as the time required for a DSP jobto catch-up to the latest offset of the incoming event streamfrom the point at which the failure occurred. It is the metricby which we measure availability in DSP jobs and is relianton a number of factors. Referring to Figure 1, the
T RT iscomposed of the following time periods:vii.
Reprocess Events Time ( E ): Represents the timeneeded to reprocess uncommitted events arriving after thelast successful checkpoint completed but before the failurewas detected. It is not possible to predict exactly howmany events are to be reprocessed nor how long this wouldtake as it is dependent on knowing the exact timestampof when the failure would occur after the last successfulcheckpoint. However, a minimum, median, and maximumnumber of messages can be estimated based on the timebetween checkpoints and the average rate of messagesbeing processed per second.viii. Timeout Interval ( T ): Represents the time the DSPsystem will wait before declaring a non-responsive workernode as failed. In most systems this is based on the heartbeat timeout configuration setting.ix. Recovery Time ( R ): Represents the length of time theDSP system will take to go from an inconsistent state backinto a consistent state after a failure has been detected.Recovery time is influenced by a number of factors, bothpecific to the DSP job (i.e. snapshots size) as well as thefeatures of the DSP (i.e. differential checkpoints, maintain-ing local copies of worker state, restart strategies, etc.).x. Warm-up Period ( W ): Represents the time it takes forthe ingress rate across all source operators to collectivelyincrease from zero to the maximum. This is reliant on anumber of factors, primarily of which are the read rates ofexternal sources and the maximum capacity of system.xi. Catch-up ( C ): Represents the time the DSP job willtake to process the backlog of messages accumulated whilethe system was unavailable. The DSP system will use themaximum processing capacity available in an attempt tocatch-up and the more physical resources that are available,the faster this process will take to complete. The catch-uptime period is directly related to the average ingress rateand the collective time spent in the proceeding phases.Calculating an accurate T RT is not a trivial matter as itis highly dependent on the operational characteristics of thespecific DSP job being executed and a number of factors whichare not known prior to execution time. However, for jobswhere there are defined QoS targets and the system shouldbe available and caught-up after a finite amount of time, anestimate is needed for configuring systems well.III. E
STIMATING THE T OTAL R ECOVERY T IME
Building upon the analysis which was presented in theprevious section, we introduce a new heuristic for estimatingthe minimum, average, and maximum
T RT per data pointgathered through profiling runs. Referring to Figure 1, the
T RT is modeled as a decreasing geometric sequence wherethe st term is composed of the time periods vii through x multiplied by a common ratio U . The first time period E isdirectly related to the checkpoint interval ( CI ) configuration.The CI is defined as the frequency with which the checkpointprocess is initiated and is measured in milliseconds. Since wecannot predict exactly when the failure will occur betweencheckpoints, we can take a best, average, and worst caseestimate. Therefore, zero in the best case scenario, the lengthof the CI divided by two in the average case, and the fulllength of the CI in the worst case. The second time period T is based on the heartbeat timeout configuration variablewhich determines how long the system will wait until a nodeis timed out. In order to calculate the common ratio, a measureof the processing capacity utilization is required. We know thatthe system will use the maximum processing capacity of thesystem on top of what is already being utilized to catch-upafter a failure has occurred. Therefore, knowing the averagerate at which messages are processed ( I avg ) and the maximumrate at which the system is capable of processing ( I max ) canbe combined to formulate the processing capacity utilizationas a percentage ( U ) as seen in Equation 1. U = I avg I max (1) Knowing the st term and the common ratio allows us toapproximate how long it takes to process events arriving whilethe system was inoperable. However, this does not account forthe events which arrived while the system was performing thiscatch-up, nor the time it takes to catch-up on the catch-up, orto catch-up on the catch-up of the catch-up, etc. It is for thisreason that we model our heuristic as a decreasing geometricseries. The formula for this series can be seen in Equation 2. C ( n ) = (cid:40) ( E + T + R + W ) · U if n = 1 C ( n − · U if n > (2)In order to determine the T RT , the sum of this geometricsequence needs to be calculated. Doing so requires that the n th term be known. It is important to note that the number of termsin a decreasing geometric sequence will approach infinity asoutputs tend to zero and therefore a stopping condition mustbe established. For our calculations, we recommend choosingthe first n resulting in a value less than one. The formula forfinding the n th term thus can be calculated using Equation 3executed in an iterative loop ..n . a n = ( E + T + R + W ) · U ( n − (3)With the number of terms n , the st term of C ( n ) , and thecommon ratio U ; a formula for finding the sum of a geometricsequence can be derived. This formula can be seen in Equation4. The output will equate to the time taken to catch-up to thelatest offset of the event stream from the point when the systemrecovered its ability to start processing events once more. S n = ( E + T + R + W )(1 − U n )(1 − U ) (4)With the S n being known, the T RT can finally be calculatedby combining it with the time periods of when the system wasin an inconsistent state after the failure, i.e. T and R . Theformula for this can be seen in Equation 5. T RT = T + R + S ( n ) (5)With the ability to estimate the T RT , we are now able toproceed to modeling the behaviors of the DSP job as a functionof the CI . IV. A PPROACH
In DSP systems, configuration has a direct impact both onperformance and availability. Yet, determining exactly howmuch of an impact is hard to ascertain. Regarding CPR faulttolerance, the most important configuration setting to take intoconsideration is the checkpoint interval ( CI ). Our approach,Chiron, automatically selects an optimized CI , given a userdefined QoS constraint. This is achieved by following threeconsecutive steps: 1. Profiling , where we make use of a numberof enabling technologies to efficiently gather metrics for DSPjobs executing with different CI configuration settings; 2. Modeling , where we take these metrics and the estimation forthe
T RT in order to model relationships for both performance untime ModelingTRT Heuristic Optimization
Metrics
ModelsDSP JobConfigurations
Profiling RunsProfiling Deployments
Constraint Parameters
ChironContainer Orchestrator
Failure injector Production Deployment
Deleted afterProfiling PhaseConfig Config Config n Optimized byChironEnd-to-end LatencyTotal Recovery Time
Fig. 2: Overview of our approach with Chiron and its interactions with users and systems.and availability in terms of the CI , and; 3. Optimization , wherewe take these models and based on the user requirements,optimize for the chosen objective. Chiron is intended to beexecuted either for a short time at the start of a DSP job orperiodically to ensure current runtime conditions are taken intoaccount. A streaming job’s configuration could, for instance,be validated using Chiron once every six hours and testinga set of configurations using the current input stream for tenminutes. An overview of Chiron can be seen in Figure 2.
A. Profiling
To meet the performance modeling requirements of DSPjobs, we use a technique for gathering metrics in environmentsthat closely mirrors realistic conditions. By employing twokey enabling technologies, i.e. OS-virtualization and containerorchestration; we are able to replicate multiple pipelines inparallel while ensuring isolation from one another. At thesame time, identical copies of the DSP job can be deployedwithin their own separate self-contained environments whilepossessing a unique variation of the system configurations.These technologies, when combined with Infrastructure-as-Code processes, provide a mechanism for quickly reproducingany environment. Our previous work, Timon [5], introduceda tool for automatically achieving these goals. To ensureresults are comparable and as close to how they would beunder realistic conditions, each parallel deployment consumesthe same data stream during profiling under normal expectedloads. In order to select a good set of input configurationvariables for each parallel deployment, the solution space isevenly explored by selecting a minimum and maximum valuefor the CI after which a set of equidistant values are calculatedbetween these extremes. The following metrics are gatheredfrom each of the parallel deployments: • Average Ingress Rate ( I avg ): Measured in eventsper second, this value represents the average rate at whichevents enter the source operators under normal load. • Maximum Ingress Rate ( I max ): Measured in eventsper second, this value represents the maximum rate at whichevents can be processed. This value can be estimated byperforming load testing. • Average End-to-End Latency ( L avg ): Measured inmilliseconds, this value represents the average time takenfor an event to traverse the execution graph from the sourceto the sink operator. Windowing periods are not consideredas part of this measurement. Average system performance isdetermined by this value and is directly affected by the faulttolerance mechanism, i.e. the larger in size and frequencyof snapshots, the greater the impact on system resources. • Average Recovery Time ( R avg ): Measured in mil-liseconds, this value represents the average recovery timeas described in the Problem Analysis section. It can bemeasured by sequentially injecting failures during profilingand averaging the time taken to recover. • Average Warm-up Time ( W avg ): Measured in mil-liseconds, this value represents the average time taken forthe ingress rate to increase from zero to the maximumprocessing capacity. Like with I max , this value can beestimated by performing load testing. B. Modelling
In our approach, we propose the generation of two two-population models. The first for modeling performance as apredictor of L avg and the second for modeling availability asa predictor of T RT . Concerning performance, data points aretaken directly from profiling runs and used for modeling. Theunction graph P ( CI ) in Figure 3(a) illustrates our estimatefor the relationship between the independent variable CI andthe dependent variable L avg . Predictably, the gradient of thecurve flattens as CI increases due to its impact on performancelessening substantially as checkpoint frequency decreases. Foravailability, the T RT estimate generated by the heuristic canbe used to model a family of functions. The function graphs A min ( CI ) , A avg ( CI ) , and A max ( CI ) in Figure 3(b) illustratethese relationships. Here we can see that for each CI inputted,three corresponding outputs can be returned for the best,average, and worst cases. Logically, lower CI frequencies willhave a more profound impact on the T RT than higher ones.With these functions we can proceed to the final optimization. (a) CI vs. Avg. End-to-end Latency. (b) CI vs. Total Recovery Time. Fig. 3: Modeling and optimization with availability constraint.
C. Optimization
The goal of our approach is to predict a CI which willconfigure the fault tolerance mechanism in such a way asto optimize for performance while setting an upper boundon the time the system will be unavailable in the event ofa failure. As such, the user will be required to provide aQoS constraint C T RT defining the upper bound on the
T RT .With this input and the outputs of the modeling process, allnecessary components are at hand to perform the optimizationstep. The choice of whether to plan for the worst or the averagecase is left up to the user. The first step in the process isto find the corresponding CI value of the associated C T RT constraint. Using the inverse of the selected A ( CI ) functionthe constraint can be inputted to find the CI value used forsystem configuration. The next step is to find the predictedminimum objective using the alternate function and finallyreturn all three values, i.e. CI , C T RT , and L avg .V. E VALUATION
Now we show that using Chiron is both practical and ben-eficial for distributed stream processing through experiments.
A. Experimental Setup
Our experimental setup consisted of a 10-node ApacheKafka cluster [6] and a 50-node Kubernetes [7] and HDFScluster [8]. Node specifications and software versions aresummarized in Table I. A single switch connected all nodes.For both experiments, 11 Flink clusters were instantiated toperform parallel profiling runs across 11 checkpoint intervalsettings ranging from a minimum of 1000ms to a maximumof 60,000ms. Parallelism for all jobs was set to 24. Each Flink cluster consisted of 1 job master in high-availabilitymode and 27 workers. All Flink workers were created with1 task slot and 4GB of memory. A total of 5 profiling runswere conducted for each experiment with the median resultingvalues being selected for modeling. Prometheus was usedfor the gathering of metrics. A user-defined T RT constraint( C T RT ) was configured for each experiment. This constraintis defined as the maximum time the job should require beforebeing available again in the event of a failure.
Resource DetailsOS Ubuntu 18.04.3CPU Quadcore Intel Xeon CPU E3-1230 V2 3.30GHzMemory 16 GB RAMStorage 3TB RAID0 (3x1TB disks, linux software RAID)Network 1 GBit Ethernet NICSoftware Java v1.8, Flink v1.10, Kafka v2.5, ZooKeeperv3.5, Docker v19.03, Kubernetes v1.18, HDFSV2.8, Redis v5.08, Prometheus v2.17, Pumba v0.7
TABLE I: Cluster SpecificationsTo determine the maximum processing capacity of eachjob, profiling runs were initiated from an earlier timestampresulting in approximately 10 minutes of catch-up time. Kafkawas configured with 24 partitions per topic and was nota limiting factor in the maximum input throughput at anypoint. Regarding latencies, averages were taken over the 0.999percentile in order to filter outliers during normal failure freeoperations. Three failures were injected per job execution viathe Pumba chaos testing tool and the average time taken for thejob to restart was calculated. Actual
T RT s were also measuredindependently in order to validate the outputs of the modelingprocess. Lastly, second order ( k = 2 ) polynomial linearregression was selected for curve fitting across all models. B. Streaming Jobs
Two experiments were conducted to evaluate the usefulnessof Chiron. Source code for these experiments can be found ongithub . The first was the IoT Delivery Vehicles (IoTDV)Experiment where a simulator generates 500,000 deliveryvehicle events per second which are submitted to Kafka toawait processing. Each event contains information about thevehicle’s geo-location, speed, and crucially whether or notthe vehicle is self-driven. The job consisted of the followingstreaming operations: read an event from Kafka; deserialize theJSON string; filter update events not within a certain radiusof a designated observation geo-point where delivery vehiclesare of a particular type, i.e. self-driving; take a 10 secondwindow where all update messages are of the same vehicle IDand calculate the vehicles average speed; generate an alarmfor vehicles which have exceeded the speed limit; enrichnotification with vehicle type information from data stored insystem memory and write it back out to Kafka. The secondexperiment is based on the Yahoo Streaming Benchmark https://prometheus.io, Accessed: Aug 2020 https://github.com/morgel/IoTDV-experiment https://github.com/morgel/YSB-experiment heckpoint Interval (ms) Average Latency (ms) (a) IoTDV Experiment: P ( CI ) . Checkpoint Interval (ms)
Total Recovery Time (ms) (b) IoTDV Experiment: A max ( Ci ) , A avg ( CI ) , and A min ( CI ) . Checkpoint Interval (ms)
Average Latency (ms) (c) YSB Experiment: P ( CI ) . Checkpoint Interval (ms)
Total Recovery Time (ms) (d) YSB Experiment: A max ( CI ) , A avg ( CI ) , and A min ( CI ) . Fig. 4: Experimental outputs of modeling process.TABLE II: IoTDV Experiment Results. (a) Coefficient of Determination.
P A max A avg A min R-Squared 0.891 0.98 0.934 0.819(b) Experimental Outputs of Optimization Process
CI L avg
Values (ms) 41,581 1447(c) Error Analysis of Experimental Outputs against Actual Values.Observation
T RT (s) 120 114 105 105 151 180 C TRT > T RT true true true true trueActual L avg (ms) 1625 1537 1501 1474 1615 1447Percent Error (%) 12.30 6.22 3.73 1.87 11.61 (YSB) . It implements a simple streaming advertisement jobwhere there are multiple advertising campaigns and multipleadvertisements for each campaign. The job consists of thefollowing operations: read and event from kafka; deserializethe JSON string; filter out irrelevant events (based on typefield), take a projections of the relevant fields (ad id andevent time), join each event by ad id with its associatedcampaign id stored in Redis; take a 10 second windowedcount of events per campaign and store each window inRedis along with a timestamp of when the window was lastupdated. For the purposes of our experiments, we modified theFlink benchmark by enabling checkpointing and replacing thehandwritten windowing functionality with the default Flinkimplementation. Although doing so does decrease updatefrequency to the length of each window, results should beaccurate and more interesting for our experiments due to theaccumulated windowing operator state at each node. C. Experimental Results A C T RT of 180s and 150s was selected for the IoTDV Ex-periment and YSB Experiment, respectively. Based on the met-rics gathered during profiling runs, 4 functions were generated https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at, Accessed: Aug 2020 TABLE III: YSB Experiment Results. (a) Coefficient of Determination.
P A max A avg A min R-Squared 0.942 0.996 0.989 0.861(b) Experimental Outputs of Optimization Process.
CI L avg
Values (ms) 35,195 826(c) Error Analysis of Experimental Outputs against Actual Values.Observation
T RT (s) 106 107 105 130 105 150 C TRT > T RT true true true true trueActual L avg (ms) 906 917 866 871 957 826Percent Error (%) 9.69 9.92 4.62 5.17 13.69 as part of the modeling process for each experiment. Theseincluded: P ( CI ) for performance modeling as illustrated inFigures 4(a) and 4(c); and the family of functions A max ( CI ) , A avg ( CI ) , and A min ( CI ) for availability modeling as illus-trated in Figures 4(b) and 4(d). R values for these functionscan be seen in Tables II(a) and III(a). For these experiments, A max ( CI ) was used for availability calculations. As the goalof these experiments was to predict a L avg and CI basedon the user-defined C T RT constraint, Tables II(b) and III(b)detail the outputs of the optimization process. Here we can seefor the IoTDV Experiment a CI of 41s was predicted with an L avg of 1447ms. Regarding the YSB Experiment, a CI of 35swas predicted with an L avg of 826ms. In order to evaluate theaccuracy of our approach, we performed an error analysis ofthe experimental values against actual observations. As part ofthe evaluation process, each experiment was executed 5 timeswith the predicted CI configuration settings and metrics wererecorded. Tables II(b) and III(b) tabulate the results of theerror analysis performed for both experiments. Here we cansee that all experimental values fall within 15% of the actualobserved values for the L avg and all measured T RT valueswere less than the C T RT . This indicates that this approach isable to closely predict actual runtime conditions. Additionally,in order to verify the accuracy of the
T RT heuristic, duringrofiling runs the total time between when the failure wasinjected and the point when all accumulated lag on the sourceoperators had been caught-up was measured independentlyduring profiling runs. The red X-marks on Figures 4(b) and4(d) represent the median values for these observations. Forboth graphs, the majority of observations were within therange between the minimum and maximum estimates and, asmedian values, find themselves plotted in close proximity to A avg ( CI ) . This is a good indication that the heuristic is able topredict that further observations will fall between these ranges.VI. R ELATED W ORK
Work most related to our own includes research aimed atadaptive checkpointing. Multi-level checkpointing has beenproposed to resolve the issue of checkpoint/recovery overhead[9]–[14]. This allows the use of different checkpointing levels,making it more flexible than traditional single-level approacheswhich are usually limited to one storage type. It considersmultiple failure types with each having a different checkpointand restart cost associated. For instance, [15] proposes atwo-level checkpointing model whereby checkpoint level 1deals with errors with low checkpoint/recovery overheads suchas transient memory errors, while checkpoint level 2 dealswith hardware crashes such as node failures. However, theseapproaches are specific to HPC environments and need to beadapted before use in DSP systems. Other approaches havebeen proposed which optimize the configuration parametersof this mechanism by finding an optimal CI to improveperformance. Some focus on determining the mean time tofailure (MTTF) of cluster nodes and then adaptively fitting a CI which minimizes the time lost due to failure [16]–[18].However, these approaches rely on jobs having a finite execu-tion time as part of their calculations which is more appropriateto HPC clusters and batch processing workloads than streamprocessing jobs where completion times are unbounded. Amore recent approach specific to DSP systems likewise in-corporates failure rates in an attempt to fit a CI based on theMTTF [ ? ]. Our work differs in that Chiron follows a profilingapproach intended to be executed either for a short time at thestart of a DSP job or periodically as apposed to continuousexecution for runtime optimizations. Additionally, we focus onperformance and availability modeling rather than MTTF.VII. C ONCLUSION
Chiron models the performance and availability behaviorsof DSP jobs based on metrics gathered from profiling runs.It does so in order to predict an optimal CI configurationsetting which, based on a user-defined maximum total recoverytime QoS constraint, will minimize the performance impactof enabling fault recovery. Profiling runs are conducted inparallel with the aid of two key enabling technologies, i.e. OS-virtualization and container orchestration. Chiron introducesa new heuristic for predicting the minimum, average, andmaximum total recovery time —a measure of the total timefrom the point when a failure occurs to the point when thejob has once again caught-up to the head of the incoming event stream. Our approach is aimed at use cases where thedata stream is processed at event time and characterized by afairly stable average input throughput. Although some variancein this rate is to be expected, use cases where the number ofevents entering the source operators is essentially random orfluctuates wildly over time is not conducive to finding optimalconfigurations through profiling runs. Future work may focuson using the metrics gathered during profiling to optimize thefault tolerance mechanism of running analytics pipelines.A CKNOWLEDGMENT
This work has been supported through grants by the GermanMinistry for Education and Research (BMBF) as BIFOLD(funding mark 01IS18025A) and WaterGridSense 4.0 (fundingmark 02WIK1475D). R
EFERENCES[1] A. Toshniwal, S. Taneja, A. Shukla, K. Ramasamy, J. M. Patel, S. Kulka-rni, J. Jackson, K. Gade, M. Fu, J. Donham, N. A. Bhagat, S. Mittal, andD. Ryaboy, “Storm@twitter,”
Proceedings of the 2014 ACM SIGMODInternational Conference on Management of Data , 2014.[2] M. Zaharia, M. Chowdhury, M. Franklin, S. Shenker, and I. Stoica,“Spark: Cluster computing with working sets,” in
HotCloud , 2010.[3] P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, andK. Tzoumas, “Apache flink: Stream and batch processing in a singleengine,”
IEEE Data Eng. Bull. , vol. 38, pp. 28–38, 2015.[4] R. Koo and S. Toueg, “Checkpointing and rollback-recovery for dis-tributed systems,”
IEEE Transactions on Software Engineering , vol. SE-13, pp. 23–31, 1987.[5] M. Geldenhuys, L. Thamsen, K. K. Gontarskay, F. Lorenz, andO. Kao, “Effectively testing system configurations of critical iot analyticspipelines,” ,pp. 4157–4162, 2019.[6] J. Kreps, “Kafka : a distributed messaging system for log processing,”2011.[7] A. Verma, L. Pedrosa, M. Korupolu, D. Oppenheimer, E. Tune, andJ. Wilkes, “Large-scale cluster management at google with borg,”
Proceedings of the Tenth European Conference on Computer Systems ,2015.[8] V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar,R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino,O. O’Malley, S. Radia, B. Reed, and E. Baldeschwieler, “Apache hadoopyarn: yet another resource negotiator,”
Proceedings of the 4th annualSymposium on Cloud Computing , 2013.[9] L. Bautista-Gomez, A. Nukada, N. Maruyama, F. Cappello, and S. Mat-suoka, “Low-overhead diskless checkpoint for hybrid computing sys-tems,” ,pp. 1–10, 2010.[10] L. Bautista-Gomez, N. Maruyama, F. Cappello, and S. Matsuoka,“Distributed diskless checkpoint for large scale systems,” , pp. 63–72, 2010.[11] H. Li, L. Pang, and Z. Wang, “Two-level incremental checkpointrecovery scheme for reducing system total overheads,”
PLoS ONE ,vol. 9, 2014.[12] A. Moody, G. Bronevetsky, K. Mohror, and B. Supinski, “Design,modeling, and evaluation of a scalable multi-level checkpointing sys-tem,” , pp. 1–11, 2010.[13] L. Bautista-Gomez, S. Tsuboi, D. Komatitsch, F. Cappello,N. Maruyama, and S. Matsuoka, “Fti: High performance faulttolerance interface for hybrid systems,” , pp. 1–12, 2011.[14] A. Kulkarni, A. Manzanares, L. Ionkov, M. Lang, and A. Lumsdaine,“The design and implementation of a multi-level content-addressablecheckpoint file system,” , pp. 1–10, 2012.15] S. Di, Y. Robert, F. Vivien, and F. Cappello, “Toward an optimal onlinecheckpoint solution under a two-level hpc checkpoint model,”
IEEETransactions on Parallel and Distributed Systems , vol. 28, pp. 244–259,2017.[16] J. W. Young, “A first order approximation to the optimum checkpointinterval,”
Commun. ACM , vol. 17, pp. 530–531, 1974.[17] J. Daly, “A model for predicting the optimum checkpoint interval for restart dumps,” in
International Conference on Computational Science ,2003.[18] ——, “A higher order estimate of the optimum checkpoint interval forrestart dumps,”
Future Gener. Comput. Syst. , vol. 22, pp. 303–312, 2006.[19] S. Jayasekara, A. Harwood, and S. Karunasekera, “A utilization modelfor optimization of checkpoint intervals in distributed stream processingsystems,”