[PDF] Aion: Better Late than Never in Event-Time Streams

Abstract

Processing data streams in near real-time is an increasingly important task. In the case of event-timestamped data, the stream processing system must promptly handle late events that arrive after the corresponding window has been processed. To enable this late processing, the window state must be maintained for a long period of time. However, current systems maintain this state in memory, which either imposes a maximum period of tolerated lateness, or causes the system to degrade performance or even crash when the system memory runs out. In this paper, we propose AION, a comprehensive solution for handling late events in an efficient manner, implemented on top of Flink. In designing AION, we go beyond a naive solution that transfers state between memory and persistent storage on demand. In particular, we introduce a proactive caching scheme, where we leverage the semantics of stream processing to anticipate the need for bringing data to memory. Furthermore, we propose a predictive cleanup scheme to permanently discard window state based on the likelihood of receiving more late events, to prevent storage consumption from growing without bounds. Our evaluation shows that AION is capable of maintaining sustainable levels of memory utilization while still preserving high throughput, low latency, and low staleness.

Full PDF

AAion: Better Late than Never in Event-Time Streams ∗ Sergio Esteves

INESC-IDInstituto Superior T´ecnicoUniversidade de Lisboa, Portugal

Rodrigo Rodrigues

INESC-IDInstituto Superior T´ecnicoUniversidade de Lisboa, Portugal

Lu´ıs Veiga

INESC-IDInstituto Superior T´ecnicoUniversidade de Lisboa, Portugal

Gianmarco De Francisci Morales

Qatar Computing Research InstituteQatar

Marco Seraﬁni

College of Information and Computer SciencesUniversity of Massachusetts Amherst, USA

Abstract

Processing data streams in near real-time is an increas-ingly important task. In the case of event-timestampeddata, the stream processing system must promptly handlelate events that arrive after the corresponding windowhas been processed. To enable this late processing, thewindow state must be maintained for a long period oftime. However, current systems maintain this state inmemory, which either imposes a maximum period oftolerated lateness, or causes the system to degrade per-formance or even crash when the system memory runsout.In this paper, we propose A

ION , a comprehensivesolution for handling late events in an efﬁcient manner,implemented on top of Flink. In designing A

ION , wego beyond a naive solution that transfers state betweenmemory and persistent storage on demand. In particular,we introduce a proactive caching scheme, where weleverage the semantics of stream processing to anticipatethe need for bringing data to memory. Furthermore,we propose a predictive cleanup scheme to permanentlydiscard window state based on the likelihood of receivingmore late events, to prevent storage consumption fromgrowing without bounds.Our evaluation shows that A

ION is capable of main-taining sustainable levels of memory utilization whilestill preserving high throughput, low latency, and lowstaleness. ∗ Aion (

Αιων ) Greek god of eternity, personifying unbounded time.

Stream Processing Systems (SPS) are increasingly em-ployed to extract insights and value from continuousstreams of data in near real-time. Examples of thisclass of systems include Storm [5], Spark Streaming [4],Samza [3], Apex [2], Google Cloud Dataﬂow [7], orFlink [15]. In many jobs handled by SPSs, each recordin the stream represents a speciﬁc event , which is as-sociated with an event time , e.g., a user clicked on anad at a certain time. In these cases, events may arriveout of the order by which they were generated, and theyare typically aggregated in event time windows (e.g.,all clicks generated in the last hour), and subsequentlyprocessed by the SPS at a given processing time .The challenge with this model is that some events mayexperience large delays between generation and process-ing times. This can happen for a variety of reasons, suchas network congestion, partitions, failures, conﬁgurationerrors, or transient connections on the device generatingthe event. These delays can prevent events from arrivingin time to be processed in their pertaining windows.The way that existing SPSs handle this case can besplit into two categories. Some systems handle this bysimply dropping late events (i.e., load shedding [27]).However, dropping events is not acceptable in missionor business critical applications that rely on completeresult sets (e.g., fraud detection, trafﬁc monitoring, orintensive care units). For example, Google’s Photonsystem [11] is used for ad billing, and, as described, eachtime an ad click is permanently ignored due to delays,money is actually lost. Hence, Google needs to set a verylarge threshold for ignoring late events (of the order of1 a r X i v : . [ c s . D C ] A p r ays), making such occurrence virtually impossible [11].Similarly, applications that log ﬁnancial transactionsmay be forced to ensure that all events are incorporatedin a given computation irrespective of their arrival time,in order to meet accounting and legal constraints.Alternatively, other SPSs, such as Flink, allow lateevents to be aggregated in an expired window for anextended period of time. However, for applicationswith operators whose state increases monotonically withthe ingested data (namely most user-deﬁned functions),maintaining concurrent windows for considerably longperiods of time can create a large memory pressure. Infact, it has been shown that SPSs are not equipped to dealwith an unbounded space cost: they start thrashing withOS paging, perform excessive JVM garbage collection,or simply crash when they run out of memory [20].In this paper, we propose A ION , a comprehensivesolution to handle late arriving events in stream process-ing. A

ION is capable of managing window state acrossmemory and persistent storage (e.g., HDD, SSD, NAS),while maintaining low latency and sustainable memoryutilization.Designing A

ION required addressing several researchchallenges: how to manage state across disk to alleviatememory pressure while not introducing major penaltiesin the processing rate due to I/O; for how long shouldthe state of a window be maintained by the SPS; andhow to update and reﬁne results in a timely and resource-efﬁcient manner. A

ION tackles these challenges by in-troducing several key techniques that leverage the se-mantics of stream processing in order to improve themanagement of data across memory and persistent stor-age.The ﬁrst technique is proactive caching , which treatsmain memory as a cache for the window state, which isotherwise ofﬂoaded to persistent storage. The main in-sight of proactive caching is that the semantics of SPSsallow the system to predict that processing is more likelyto be necessary at speciﬁc times, for example when atime window expires. This enables using a proactive approach, where I/O is regulated by a central scheduler,which tries to evict data ahead of time, thus minimiz-ing the performance penalty of ofﬂoading to persistentstorage in terms of both latency and throughputThe second technique introduced by A

ION is predic-tive cleanup . This uses a past history of the distributionof late event arrival times to predict the best time to purgethe state of a windowed operator completely, based onits likelihood of receiving more events; i.e., the state canbe purged when we do not expect to receive more events (or, alternatively, less than a given fraction of events forthat window) within a chosen conﬁdence interval.Finally, we also address the issue of updating lateresults. For past windows, it is desirable to amend previ-ously emitted results as soon as late events arrive. How-ever, recomputing a monotonic window (whose stateincreases with ingested data) for each received event iscomputationally expensive. To address this, we providea trigger that is able to ﬁnd a good compromise betweenstaleness of the result and resource usage (or number ofexecutions), thereby identifying the adequate times forrecomputing a past window.We implemented A ION by extending the codebaseof Apache Flink, a widely used distributed SPS. Weevaluate A

ION using benchmarks and practical applica-tions. Experimental results indicate that A

ION is capableof handling large amounts of lateness, well beyond thelimit where current SPSs run out of memory and crash,thereby maintaining sustainable levels of memory utiliza-tion while still preserving high throughput, low latency,and low staleness.The remainder of the paper is organized as follows. § § ION . § § § § Before delving into the technical details of our system,we review a few key concept related to the computationalsemantics of SPSs supporting event-time processing.Streaming applications are commonly represented inthe form of directed graphs that represent the data ﬂow ofthe application. The vertices of the graph are data trans-formations (operators), and its edges are channels thatroute data between operators. The data ﬂowing alongthese edges is a stream, represented as a sequence of events , each associated with a key and a timestamp . Thekey is speciﬁed by the application, and is data-dependent(e.g., an ad identiﬁer). To achieve high throughput, mod-ern distributed engines leverage data parallelism by cre-ating several instances of an operator that process inde-pendent sub-streams.An SPS reads data from one or more sources. Therate at which the data is read is called ingestion rate ,whereas the rate at which an operator processes data iscalled processing rate . For an SPS deployment to be2ustainable, it needs to offer a processing rate that cancope with the ingestion rate, at least on average overtime.

Time domains.

An important component of the abstrac-tion provided by the operators is that events are associ-ated with a timestamp. For assigning these timestamps,three different notions of time have been considered: processing-time , ingestion-time , and event-time [24].With processing-time, each operator assign a times-tamp to an event independently, based on the the currentsystem clock time when it processes the event. Ingestion-time refers to the time when events enter the system, andis assigned to an event by the ﬁrst operator that reads itfrom the data source. Finally, event-time is associatedto an event outside the SPS, when it is generated. Event-time enables out-of-order streams of data events to begrouped and ordered by their timestamps, hence givingconsistent results that are robust to delays (i.e., the resultof the computation is the same irrespective of the orderin which events are processed) or the mode of operationof the system [10]. Windowing.

A window groups events in time, allowingan inﬁnite stream to be processed in ﬁnite batches [17].A single event can be part of zero, one, or many win-dows, according to the user-speciﬁed semantics. Com-mon types of windows include:

Tumbling , ﬁxed-sizewindow with no overlap with other windows (e.g., tocompute hourly aggregates);

Sliding , ﬁxed-size windowthat slides by some amount (e.g., compute hourly ag-gregates every 10 minutes);

Session , dynamically sizedwindow, which represents a consecutive, data-dependentportion of the stream, usually deﬁned per key (e.g., agroup of events separated in time by no more than adeﬁned gap constant); and

Count , window that groups aﬁxed number of consecutive events, irrespective of theirtimestamps.

Watermarks.

When using event-time, a watermark sig-nals the time when the system assumes that all eventsup to a certain event timestamp t have arrived at an op-erator [16]. For example, a watermark can signal that(ideally) all events in a given window have been received.A watermark is always a best guess: events with a times-tamp lower than the watermark timestamp t may still bereceived, and are considered late . Late data may simplybe dropped, or, in case the SPS can handle lateness, in-corporated into the state of a window that has alreadybeen processed. In the latter case, several semantics arepossible, depending on the requirements of the applica-tion. For instance, given an operator that computes the average of the values in a window, a late event mighttrigger the window to emit the new average incremen-tally by simply keeping track of the sum and the numberof items, and updating these upon receiving late events.However, the case of non-linear functions such as per-centiles or arbitrary UDFs is particularly complex: thewhole state of the old window needs to be maintainedin order to allow late events, and the whole computa-tion needs to be re-executed. One of the design goals ofA ION is to be generic, thus handling such operators.There are two main types of watermarks: periodic andpunctuated. Periodic watermarks are emitted based oneither processing time (every p seconds) or stream ele-ments (every p events). In turn, punctuated watermarksare emitted based on conditions inferred from the datawhen a particular event arrives. For instance, a sourcemight emit a watermark when an explicit ﬂush eventarrives.The generation of a watermark involves a delicatetrade-off. If it is emitted in a conservative way, thesystem might wait longer than actually needed to processevents, thus increasing latency. Conversely, if watermarkis emitted too fast, a large fractions of events will becomelate, thus adding overhead to the computation. Issuingwatermarks is often based on a heuristic, since it is ingeneral impossible to tell when all events belonging to awindow have arrived [21]. Triggering.

A trigger is a mechanism that determineswhen a windowed operation should be executed, i.e.,when to compute the value of the function over the datain the window. By default, a window is triggered whenits watermark is emitted, but it can also be triggeredat other times, using different policies, e.g., percentilebased (when some percentage of the data has been ac-cumulated), data based (counts, punctuation, patternmatching), or even via external signals. In addition, forlate events, a window is also triggered when the systemtime has reached the watermark plus the maximum al-lowed lateness , which means that no further late eventsfor the window are accepted, and the function result isﬁnal.

Operator State.

The discussion so far can be appliedto any modern SPS. However, in terms of the operatorsemantics and the state they are able to maintain, thereis no generally accepted API. Therefore, we focus onthe system that we use as a base for our implementation,which is Apache Flink.State is used in stateful operators, which need to retainsome memory of the events that were previously pro-3essed (e.g., counts for aggregates, parts of the streamsfor pattern matching, model parameters for machinelearning). Flink uses a managed state API, by whichoperators can access a set of standard state prototypes,usually one per key: • ValueState , a single value that can be retrieved andupdated, e.g., a boolean indicating if an event withthe same key has been received in the current win-dow; • ReducingState/FoldingState , a single value thatrepresents an aggregate of the processed sub-stream,computed via a reduce or fold function, e.g., a per-key sum of the events in the window; • ListState , a list of elements, which can be iteratedand appended to, usually containing the events in thewindow;Of the three state prototypes, ListState is the one usedby default in custom operators, as it is the most gen-eral. However, it is also the most expensive in termsof memory, which can cause heavy pressure when thesystem needs to maintain a large number of windowsactive (because of a conservative watermark, or a largemaximum allowed lateness).Each operator can declare several state elements, andFlink will manage their distribution and lifecycle (check-pointing and restoring). For each window processed byan operator, Flink maintains a separate instance of theoperator state.The default state backend of Flink stores the state inmemory. When the maximum allowed lateness for anoperator is large, the number of windows to maintaincan grow considerably, thus exerting pressure on themain memory. When designing A

ION , our goal is tomake judicious use of persistent storage to limit mainmemory usage, and thus alleviate this pressure, withoutsacriﬁcing throughput or latency.

ION

Design A ION provides mechanisms to handle late events, bymanaging state data across both memory and persistentstorage. In particular, our goal is to achieve the best ofboth worlds by ( i ) preserving the performance beneﬁtsof in-memory processing, while ( ii ) providing signif-icantly more space to maintain state data across bothmain memory and persistent storage.The way A ION is able to circumvent both memorysize limitations and persistent storage latency is by tak- m-bucket(fixed size)stream window state 1 window state 2 window state 3 event time t0 t1 t2 t3 p-bucket(varying size) ......... destage stage processing time t0 t1 t2 t3 persistentstoragememory

Figure 1: Different shades of gray represent howevents are aggregated into windows. A

ION main-tains the window state in both memory (m-bucket)and persistent storage (p-bucket). ing advantage of the semantics of event-time streamprocessing, in order to perform a proactive managementof past windows. In the remainder of this section, weoutline how A

ION achieves these goals. A ION splits the state of each window into two logicalcontainers, called memory bucket and persistent bucket (abbreviated as m-bucket and p-bucket , respectively), asshown in Figure 1. The m-bucket resides in memory andhas a limited maximum size; the p-bucket is in persis-tent storage and is only bounded by the total persistentstorage size, which may be considerably larger (e.g.,Terabytes). A

ION keeps latency low by using proac-tive caching , which populates the m-bucket with eventsfrom the p-bucket, such that, in most cases, accessingin-memory data can be done without blocking on I/O.In particular, this technique consists of transferring databetween m-buckets and p-buckets in a way that is decou-pled from the process of feeding window operators. Inother words, window operators always access m-buckets,and the transfer of data from the m-bucket to the p-bucketand back is asynchronous. This asynchrony then allowsus to deﬁne ﬂexible strategies for scheduling I/O, accord-ing to one of the policies that we explain next.

The choice of the timing of data transfers between them-bucket and the p-bucket takes into consideration thesemantics of the streaming application. In particular,there are four situations that A

ION needs to consider: ( i )populating the window state for the ﬁrst time; ( ii ) exe-cuting operators upon triggering; ( iii ) dealing with wa-4ermarks and integrating late events into past windows;( iv ) computing ﬁnal results. We start by describing thestandard policy for each of these situations, and thendiscuss several alternative policies. Standard policy.

When populating the state of window w , events are initially stored in the m-bucket of w . Whenthe m-bucket becomes full, A ION redirects new eventsdirectly to the p-bucket. Subsequently, when w is trig-gered for execution, A ION executes the window operatorby fetching all data from the associated m-bucket. Atthe same time, A

ION transfers data from the p-bucket tothe m-bucket in the background, a process called stag-ing . Reading from the m-bucket while staging from thep-bucket allows us to mask the I/O latency.Eventually, the watermark reaches the end of w , whichmakes it expire (i.e., it becomes a past window). At thispoint, a destaging operation takes place so that all data inthe m-bucket is transferred to the p-bucket, thus releasinga signiﬁcant amount of memory. Subsequent arriving(late) events for w are written directly to the p-bucket.When a late event arrives, the window is scheduledfor re-execution. However, the re-execution of a latewindow has low priority to avoid interfering with theexecution of current windows, since these are the mostup-to-date results that should be immediately displayedto the user.The key to reducing the I/O overhead associated withstaging is to prestage state to the m-bucket before there-execution occurs. To this end, we employ proactivecaching , which estimates an appropriate time to startprestaging, by anticipating when the operator will re-execute. This is achieved by taking into account thedifferent semantics of different types of watermarks. Inparticular, assessing re-execution time with periodic wa-termarks is trivial, since we have knowledge about theperiod of watermark generation and current logical time.In this case, during the ﬁrst late execution for the window w , pre-staging starts pessimistically when the windowimmediately preceding w fully expires (including maxi-mum allowed lateness). During this process we assessthe overall time taken ( ∆ t ) weighted by the number ofstaged events. Then, for subsequent re-executions of w , we start pre-staging ∆ t time before the operator re-execution time. For the case of punctuated watermarks,pre-staging for a window can start as soon as a lateevent for that window is received, since it indicates anupcoming re-execution, which may be delayed until pre-staging concludes. In both cases, the m-bucket of thepast window is freed after re-execution. Additional Policies.

To be more ﬂexible and extensible,A

ION ’s design allows for deﬁning additional policies.They can be categorized as either local , when they do nottake into account the overall system memory utilization,or global , in case they regard the system as a wholewhen optimizing memory.To demonstrate the ﬂexibility of our design, we pro-vide a few illustrative examples, starting with local poli-cies: • When a watermark arrives, and if late events are al-lowed, destage the window state except for a (small)fraction ρ min of initial events, which act as a boot-strap set for later re-staging the window. • When more than τ processing-time elapses (e.g., amultiple of median window processing time) with-out the window either getting new events or a wa-termark, destage the window state except the ρ min set.Global policies, in turn, can include the following. • When the available memory µ is moderately scarce,successively destage window state to disk (excepttheir ρ min bootstrap set) in a selective way, e.g., ei-ther by descending order of individual window statesize (for faster savings), or by increasing values ofingestion rate of individual windows (to minimizewindow processing delay) • When the available memory µ is very scarce (e.g.,below a given threshold of 10% of physical mem-ory), destage the state of all windows to disk excepttheir ρ min set. The size of the m-buckets depends on the type of compu-tation to be executed.

Blocking window operators needto consume the entire input before starting the main pro-cessing task. For example, when applying an FFT over asliding window, the entire input data needs to be fetchedbefore processing can start.

Non-blocking window oper-ators are able to perform the main processing task whileevents are fetched one by one. This is the case, for ex-ample, when computing n-grams over a stream of wordssorted by event time.For blocking operators, the size of the m-bucketshould be equal to the size of the entire window input,otherwise computation is affected by I/O latency. Fornon-blocking computations, the size of the m-bucket5nly matters, in terms of overall computation perfor-mance, when staging the events from the p-bucket takeslonger than processing the events initially present in them-bucket (thus, no longer masking I/O latency). Thisconstraint is driven by the relative sizes of the bucketsand the relative speeds of staging and processing theevents for a given computation. In summary, we aim foran m-bucket size that is large enough so that the functionnever has to wait for events that are still in the p-bucket.

Ideally, A

ION should be able to handle unbounded late-ness. However, not only persistent storage is limited, butalso the usefulness of windowed data becomes residualover large periods of time. Therefore, A

ION incorpo-rates a predictive cleanup mechanism for purging win-dow state completely from the system, when that state isconsidered very unlikely to be needed.The amount of elapsed time to perform predictivecleanup is updated in an adaptive way. To this end, thesystem continuously observes the distribution of lateevents (including late events that arrive beyond the max-imum lateness bound). The idea is then to start witha conservatively large lateness bound, and, after a rep-resentative history of observations is collected, adjustthis bound at runtime for newly created windows in away that is estimated to cover a speciﬁed percentage ofthe events (e.g., 99%) within a certain conﬁdence in-terval. We keep updating this distribution according tonew observations, so that this estimate is as accurate andup-to-date as possible.Before this maximum bound of allowed lateness ex-pires, it is desirable to update previously emitted resultsonce they become signiﬁcantly inaccurate due to the ar-rival of new events. However, computing a window foreach newly arrived late event can be very costly in termsof system resource usage. One possible solution to thisproblem would be to update this computation periodi-cally; however, this can lead to unnecessary executionswhen the number of new events since the last executionis small or nonexistent. Conversely, during a spike oflate events, the computed value might be signiﬁcantlyout-of-date for non-negligible periods.To address this, we introduce a new trigger that oper-ates according to staleness . We deﬁne staleness, betweenpairs of consecutive executions, as st = t ∗ n / ( T ∗ N ) ,where t and n are the time elapsed and the number ofevents accumulated since the last execution, respectively; T and N are the maximum possible time (i.e., maximum allowed lateness) and accumulated events (i.e., total num-ber of late events expected), respectively. Staleness canbe user deﬁned, according to a speciﬁed SLA, e.g., abound on the maximum outdated result users can toler-ate.Based on the distribution of late arrivals, our triggerdetermines the minimum number of executions neces-sary to comply with the maximum staleness bound. Tothis end, we assess the staleness for each instant of time(or period of time, if there are too many instants) andplace an execution at the time that violates the bound;we iteratively repeat this process until we reach maxi-mum allowed lateness. Due to the irregular nature ofthe distribution, it is likely that the staleness of the lastpair of executions is smaller than all the others, meaningthat the maximum staleness that we obtain, in any pairof executions, could be lower for the same amount ofexecutions.To minimize and balance staleness across pairs ofconsecutive executions, we apply an optimization algo-rithm (variation of gradient descent [23]). It minimizesthe maximum staleness and returns the instants of timewhere we should re-execute the window. It starts with anarbitrary conﬁguration of execution times (to optimize,we make the starting execution times correspond to theplaces where the distribution of late arrivals has higherrelative density). After, it adjusts the execution timesbased on the negative gradient of staleness in order oftime. We repeat this process until standard deviation ofall staleness values is very close to zero (i.e., staleness isbalanced across pairs of executions and the maximum isalready at the minimum), or when a maximum numberof iterations is reached, so we can bound the time spentwith this process. Due to the strategic placement of theﬁrst execution times, we found out that our algorithmconverges very fast (less than a second) to the minimumvalue of maximum lateness, and never reached our limitof iterations.Overall, our trigger minimizes staleness at a minimumnumber of executions (necessary to achieve speciﬁedstaleness bounds). We implemented A

ION as a state backend in ApacheFlink version 1.1.1. Our source code is publicly avail-able [6]. We are currently engaging in transferring thistechnology to the Flink codebase, and have consequentlyinitiated an issue in the Flink tracking system. Next, we6escribe the implementation of the key aspects of ourstate backend.

Transparency to applications.

To make use of ourFlink backend, applications only need to specify an op-tion in the stream environment conﬁguration.

I/O Scheduling and Priorities.

Using m-buckets andp-buckets can decouple the process of feeding windowoperators from the I/O activity with persistent storage.Destaging data is carried out in the background withlow priority, not to impact the performance of otheroperators. In contrast, staging should have maximumpriority, since data to be fetched from the p-bucket isrequired immediately by the window operator that isexecuting.A challenge in this context is that both staging anddestaging are I/O intensive operations and can interferewith one another. Moreover, there are events being writ-ten simultaneously to destaged windows, which alsocauses I/O activity. To prevent I/O contention, we resortto a single thread whose sole responsibility is to serializeand prioritize requests, and to perform all I/O relatedoperations on persistent storage. This thread assignsdifferent priorities to different operations, according totheir potential impact on performance: pre-staging hasmaximum priority, followed by writing late events, andthen destaging.Although uncommon for sustainable workloads, theseoperations (namely destaging) might not ﬁnish in time.This happens when the time between the start of theoperation and when the data is needed is not sufﬁcientto carry out the operation entirely, while possibly in-terleaved with other operations (e.g., destage operationbeing interrupted multiple times by staging requests). Ifdestaging is incomplete, it means that we could havereleased and saved more memory; if staging is incom-plete, it means that operators might experience someI/O latency. However, given the priority of operations,the fact that fetching is done in the background, and theneed for long term sustainability of the workloads, webelieve this to be an unlikely event in practice.

Input iterator.

The input events that are accumulatedin a window state are exposed to the application-speciﬁcprocessing functions through an iterator. In existingimplementations, iterators are initialized in an eagerway: the corresponding data structure object (e.g., list)is allocated in memory with all its contents (some ofwhich might not even be used by the window function).Since the initialization time can be high, especially ifthese contents are not in memory, an eager iterator might squander memory and CPU time. In contrast, A

ION uses lazy iteration : input events are retrieved from thep-bucket as they are requested. For example, when theiterator is called for the ﬁrst time, it can issue a stagingrequest to start staging events from the p-bucket, whileat the same time it returns events from the m-bucket tothe window operator.

Staging and serialization.

During destaging operations,a potentially large number of events needs transferringfrom memory to persistent storage through serialization.To speed up this CPU-intensive task, we use multiplethreads serializing blocks of events concurrently, writingout to disk in sequentially accessed ﬁles. The blocks arethe basic unit inside m-bucket. We also use multithread-ing for deserialization in staging operations.In A

ION , we rely on JSON serialization since it is notthe bottleneck in our experiments, and allows us to bettercontrol the ﬁle partitioning. Using better performingserialization schemes (e.g. Kryo, Protocol Buffers, Avro)is straightforward, although orthogonal to our main goal.The serialization used can be easily changed in A

ION ,e.g., to also compress data. Note that Flink itself alreadyensures application data types to be serializable.

In order to validate and demonstrate the effectiveness ofA

ION we conducted an experimental evaluation of ourprototype. The main objective of the evaluation was toprovide answers to the following questions. Q1 Does A

ION handle memory pressure effectively byofﬂoading state to disk when needed? Q2 What is the overhead of A

ION , for the case whereFlink is able to operate fully in-memory? Q3 What are the beneﬁts of each individual optimiza-tion? Q4 Can A

ION comply with maximum staleness boundswhile using resources efﬁciently?

Workloads.

Our experiments are based on two micro-benchmarks, average and bigrams , and two real-worldscenarios, stock market and

Linear Road Benchmark(LRB) . The micro-benchmarks correspond to a compu-tation dataﬂow that applies a single windowed functionover a data stream, calculating either the average of astream of randomly generated integers, or all bigramsfor a stream of real twitter posts (tweets). Both of thesecomputations are non-blocking, with bigrams having7uch higher time complexity (2-3 orders of magnitudehigher).The ﬁrst benchmark application, stock market, imple-ments an ofﬁcial Flink example of a prototypical com-plex dataﬂow [8]. The application receives a (synthetic)data stream of stock market prices for different stocksymbols. Over this stream, it applies rolling aggrega-tions per stock (min, max, mean) in a sliding window(of 10 seconds every 5 seconds). Then, it uses a customtumbling windowed function to detect when the price ofa stock has suffered a variation of at least 5%, and emitthe corresponding stock symbols, as price warning alerts,to the next downstream operator. This operator, in turn,counts the price alerts per symbol in a tumbling win-dow. In a second substream, the application receives a(synthetic) stream of tweets with mentions of stock sym-bols, and counts the number of mentions per symbol in atumbling window. It ﬁnally joins the two substreams onkey symbol, and computes the correlation between thenumber of symbol mentions and the number of alerts persymbol using a custom function with a tumbling window.This application must handle late events so that decisionmakers can rely on accurate historical information.The second benchmark consists of a variable tollingsystem for a ﬁctional expressway structure based on theLinear Road Benchmark (LRB) [12]. This system calcu-lates different toll rates for different segments of a road-way based on their levels of congestion. The data streamthat is fed as input to the dataﬂow is generated by theMIT-SIMLab (a simulation-based laboratory) [28] andconsists of vehicle position reports. The LRB dataﬂowcan be summarized as follows. First, position reports areissued every 30 seconds by a transponder at each vehicle,identifying its exact location in the expressway system.These reports are used in two distinct substreams: 1) theyare aggregated in a minute-long window to compute thenumber of vehicles and their average speed for every seg-ment of every expressway; and 2) they are aggregatedin a minute-long window with a custom function thatdetects the existence of accidents for every expresswaysegment. Subsequently, these two substreams are joinedby key on segment; then, a custom function computesthe corresponding toll based on the number of vehicles,their average speed, and the existence of accidents in aminute-long window. Position reports might suffer tem-porary network disconnection or arbitrary delays, and itis necessary to incorporate the effects of late events, e.g.,to ensure the accuracy of the system and of the decisionsthat affect billing and incentives to redirect trafﬁc.

Scenario Max ingestionrate (events/s) Windowduration (s) Payload size(bytes)Average 10000 20 2304Bigrams 5000 30 3584Stock market 10000 30 1664LRB 10000 60 1536

Table 1: Workload parametersEvent timestamps.

For all referred scenarios and win-dows, we assign timestamps when events are producedby data generators. To do this, we read the current sys-tem clock and subtract a time value to make it fall eitherin the current window or in a past window, thereby sim-ulating event delays: ts = currentTime − windowIndex × windowDuration Thus, the timestamp ts is given by the current timesubtracted by a certain number ( windowIndex ) of win-dow lengths. To simulate a realistic delay, we set the windowIndex based on a log-normal distribution (meanand stddev are 0 and 1 respectively). Thus, the likelihoodthat a window receives an event decreases exponentially,as expected in most practical scenarios. Setup.

For all experiments, we compared the use of ourbackend, A

ION , with a baseline consisting of Flink’sexisting backend, whose implementation is only able toretain all window state in memory. Note that our gainsand overheads come from the custom (i.e., user-deﬁned)windowed functions; for other stateful operators, that donot rely on ListState, we perform similarly to baseline.For our backend, we used the standard policy (see § , etting. All tests were conducted using two machineswith an Intel Core i7-2600K CPU at 3.40GHz, 11926MBof RAM memory, and HDD 7200RPM SATA 6Gb/s32MB cache, connected by 1 Gbps LAN. One machinewas used to run the data generators and the other to exe-cute the streaming applications. This setting shows thebeneﬁts of A

ION on a per-node basis. The running envi-ronment consisted of Ubuntu 14.04.1 LTS (GNU/Linux3.13.0-116-generic x86 64), Java HotSpot(TM) 1.8.0 77and Flink 1.1.1. Our source code and the setup for theexperiments is publicly available [1].

Q1. Does A

ION handle memory pressure effectivelyby ofﬂoading state to disk when needed?

Figure 2 shows the heap usage of our approach com-pared to the baseline. For the baseline system, as thenumber of past windows (i.e., the maximum allowedlateness of events) increases, the heap usage also in-creases, since more state data has to be maintained andaccumulated in memory over time. The heap utiliza-tion of the baseline eventually becomes so large thatthe system crashes due to insufﬁcient available memory.This happens after 7, 9, 5, and 8 past windows for aver-age, bigrams, stock market, and LRB, respectively. Incontrast, A

ION is able to maintain a stable and efﬁcientmemory utilization over time, roughly 3-4 GB as themedian, regardless of the number of past windows, thusscaling window state for a potentially unbounded timeframe. Such capability comes from the fact that A

ION keeps only the state of active windows in memory; pastwindow state is destaged and kept in persistent storage.Thanks to proactive caching , this comes without im-pacting the ingestion rate (as we will see next). Finally,A

ION offers signiﬁcant savings in terms of median heapmemory usage: it uses between 50% (bigrams) and 24%(stock market) less memory than the baseline.

Q2. What is the overhead of A

ION , for the casewhere Flink is able to operate fully in-memory?

We now measure the overhead of A

ION in terms ofingestion and processing rates. In Flink, the processingrate (events processed per second in a window) can af-fect the ingestion rate (events received per second), andtherefore it is possible that, over time, the latter does notremain stable at its maximum (as shown in experimentalsetup).

Ingestion rates.

The ingestion rate measures the end-to-end throughput of the system, and is the most importantmetric for the performance of an SPS. A high ingestion rate shows that the system can keep up with its inputs.In particular, if the time it takes to process a windowexceeds the window interval, which is the risk a systemincurs when ofﬂoading window state to the disk likeA

ION , then the ingestion rate drops. Our evaluationshows that A

ION has virtually no impact on the ingestionrate, thanks to its use of proactive caching .Figure 3 shows, for each benchmark, the ingestionrate of normal (non-late) events only. We can observethat 1) with the exception of bigrams, there are no largevariations across executions for different values of thenumber of past windows; and 2) the differences betweenA

ION and baseline are relatively small. The higher vari-ation in bigrams is linked to the fact that its input events(tweets) have a more variable size. Different input eventsizes in bigrams, which is a computationally complexworkload, cause different compute times over the exe-cution timespan (note that processing time makes theingestion of new events to stall in Flink).The results indicate that A

ION is on par with base-line in terms of end-to-end performance: in the mostfavorable case, as baseline starts thrashing and crashing,A

ION ingested up to 17% more events than baselinewith LRB; in the least favorable case, baseline ingestedup to 18% more events than A

ION with LRB. All otherworkloads show variances between 4% and 10%.Figure 4 shows the ingestion rate of each benchmarkfor different lateness values, but this time including alsolate events. The ingestion rate decreases as the numberof past windows increases in this case. This comes asno surprise: as a window ages, it is likely to receivefewer events, and therefore the overall ingestion ratetends to decrease as we extend the lateness timespan.Nevertheless, the ingestion rate values for A

ION getslightly closer to the baseline values: we go from a gainof 12% using A

ION (average) to a gain of 16% usingthe baseline (stock market). Other workloads exhibiteda variation ranging from 1 to 9%. This happens becausethe number of normal and late events received over timedecreases exponentially, which makes the differencessmaller and more stable.

Processing rate.

The processing rate gives a more low-level insight on the overhead of A

ION . The previousexperiments show that, in all cases, the processing rateof A

ION is sufﬁcient to keep up with the ingestion rateof the application. The following experiments show thatwith windowing functions having high computationalcomplexity, A

ION has a similar processing rate as thebaseline, since the cost of fetching data from disk can9 ast windows heap u s ed ( M B ) aion baseline (a) average past windows heap u s ed ( M B ) aion baseline (b) bigrams past windows heap u s ed ( M B ) aion baseline (c) stock market past windows heap u s ed ( M B ) aion baseline (d) LRB Figure 2: Heap usage across workloads. Boxes represent: median of the data, lower and upper quartiles(25%,75%) past windows i nge s t i on r a t e aion baseline (a) average past windows i nge s t i on r a t e aion baseline (b) bigrams past windows i nge s t i on r a t e aion baseline (c) stock market past windows i nge s t i on r a t e aion baseline (d) LRB Figure 3: Ingestion rate of normal (non-late) events only.

Whiskers show minimum and maximum values past windows i nge s t i on r a t e aion baseline (a) average past windows i nge s t i on r a t e aion baseline (b) bigrams past windows i nge s t i on r a t e aion baseline (c) stock market past windows i nge s t i on r a t e aion baseline (d) LRB Figure 4: Ingestion rate of normal and late events past windows p r o c e ss i ng r a t e ( x ) aion baseline (a) average past windows p r o c e ss i ng r a t e ( x ) aion baseline (b) bigrams past windows p r o c e ss i ng r a t e ( x ) aion baseline (c) stock market past windows p r o c e ss i ng r a t e ( x ) aion baseline (d) LRB Figure 5: Processing rate of normal (non-late) events only. ast windows G C c o ll e c t t i m e ( s e c ) l l l l l l ll l l l l l l aion ygbaseline yg aion ogbaseline og l l (a) average past windows G C c o ll e c t t i m e ( s e c ) l l l l l l l l ll l l l l l l l l aion ygbaseline yg aion ogbaseline og l l (b) bigrams Figure 6: GC collecting time be amortized (thanks to proactive caching ). For func-tions with low computational complexity, A

ION has arelatively lower processing rate, but this does not mat-ter in absolute terms since windows can be nonethelesscomputed quickly enough.Figure 5 depicts, for each benchmark, the normal (nonlate) event processing rate of A

ION and the baseline,when varying between 1 and 10 past windows. Severalthings can be observed. First, we can see that for aver-age, bigrams, and LRB, the processing rate of A

ION ismostly stable as the number of past windows increases;in contrast, the processing rate of the baseline is mostlyunstable for all the considered scenarios. Second, foraverage and bigrams, although the processing rate of thebaseline starts by being higher than A

ION , it followsa decreasing tendency as the number of past windowsincreases. This phenomenon occurs because the systemstarts thrashing: as the heap usage reaches close to itslimit, the JVM Garbage Collector is activated for longerintervals in the old generation (as shown in Figure 6a),which in practice steals CPU time from the applications.Moreover, average exhibits more accentuated differencesbetween A

ION and baseline (up to 31%). Such differ-ences come as a result of A

ION having signiﬁcant moreGC activity on the young generation than the baseline(as shown in Figure 6a). The increased activity is due tothe additional backend data structures that we manage.Although stock market generates a complex dataﬂowin terms of its streaming graph, the windowed functionsthemselves have a low time complexity: each windowtakes less than one second to process tens of thousands ofevents. As such, the ﬂuctuation in the time for processinga single event is much higher. Nonetheless, because thecomputation time is so short, A

ION can still keep upwith the ingestion rate, so this relative difference is notrelevant in terms of end-to-end performance.Finally, for LRB, the processing rate has less vari- ance than stock market because the computation time ishigher. When the baseline starts thrashing (after just 5past windows), the ﬁrst quartile of the processing ratedrops drastically. This behavior results from alternatingbetween high compute time (which includes GC time)with low ingestion rate: as the GC activity for the oldgeneration increases, processing time increases, and in-gestion rate decreases; as such, for the next watermark,fewer events are expected, which makes GC activity andprocessing time decrease; in turn, this makes ingestiontime increase and this cycle repeats. Furthermore, therelative difference becomes between A

ION and baselinebecomes signiﬁcant because this workload has two mem-ory intensive custom functions, which results in a higherGC activity on the young generation.Figure 7 shows the processing rate when late eventsare included. Variance is generally reduced, especiallyfor stock market and LRB. Similarly to what was de-scribed before for Figure 4, the number of events isgreatly reduced as a window gets older, and this atten-uates the differences between A

ION and baseline overtime.To summarize, there are two main take-aways. First,A

ION overheads are realistically low, as the higher thecomplexity of the custom windowed functions, the closeris A

ION processing rate to the baseline. Second, inaddition, the processing rate only becomes relevant asan overhead when it makes the streaming application notsustainable across time. As long as the system is able tocontinuously provide results for every ﬁxed time interval(sustainability condition), corresponding to the latencyrequirements deﬁned through window duration values,the processing time overhead can be disregarded.

Q3. What are the beneﬁts of each individual opti-mization?

We now assess the impact (contribution and rele-vance) of the individual optimizations: pre-staging,multi-threading serialization, and single thread (sequen-tial) I/O. We employ the average workload, since it is asimple pipeline with single window and low complex-ity function (i.e., where optimization effects are moreisolated and events need to be fetched quicker). The opti-mizations are especially important to reduce the fetchingtime of a window operator when most of the state data re-sides in the p-bucket, which is the case with the standardpolicy when the allowed lateness time expires.Figure 8 shows the effect that each optimization hason the heap usage, ingestion and processing rate of all11 ast windows p r o c e ss i ng r a t e ( x ) aion baseline (a) average past windows p r o c e ss i ng r a t e ( x ) aion baseline (b) bigrams past windows p r o c e ss i ng r a t e ( x ) aion baseline (c) stock market past windows p r o c e ss i ng r a t e ( x ) aion baseline (d) LRB Figure 7: Processing rate of normal and late events past windows heap u s ed ( M B ) aion−fullno−pre−stgng no−mt−srlzno−sqntl−io past windows i nge s t i on r a t e aion−fullno−pre−stgng no−mt−srlzno−sqntl−io past windows p r o c e ss i ng r a t e ( x ) aion−fullno−pre−stgng no−mt−srlzno−sqntl−io Figure 8: Effect of optimizations on normal and late events for the average workload. aion-full correspondsto the system fully optimized (with pre-staging, multi-thread serialization, and single I/O thread); no-pre-stgng is A

ION with pre-staging off; no-mt-srlz is A

ION with single serialization thread; and no-sqntl-io isA

ION with multi-threads performing I/O operations simultaneously. events, when varying the number of past windows from1 to 10. First, for no-pre-stgng, we can see that it usesless memory for average than aion-full (left sub-ﬁgure),which is natural since pre-staging loads state data in ad-vance, and thus keeps memory occupied for a slightlylonger time. However, no-pre-stgng performs signiﬁ-cantly worse – by 2 orders of magnitude – in process-ing rate (bars close to zero in right sub-ﬁgure), sinceit fully exposes the I/O latency by accessing the persis-tent storage (p-bucket) while the function is executing.As a consequence of the longer processing times withno-pre-stgng, the corresponding ingestion rate (centralsub-ﬁgure) is also affected negatively: aion-full receivesroughly 20% more events for average. We can thusconclude that pre-staging is a key feature in A

ION .Second, we may observe that no-mt-srlz is not ableto stabilize heap usage as the lateness time increases,ending up crashing after 8 past windows. This showsthat a single thread for serialization is not sufﬁcientto serialize data fast enough in destaging operations,leading thus to poorer memory savings. Similarly, onethread for deserialization is also not enough, since theprocessing rate falls as the number of past windows increases and more events have to be staged.Finally, we can infer, for all of three metrics, that theperformance values are closer between no-sqntl-io andaion-full, yet no-sqntl-io reveals a decreasing trend inprocessing rate as the lateness time increases. This trendresults from the fact that as we keep more events fromthe past in memory, the more likely it is to have destagingand staging operations to be incomplete at the time whenthe window execution starts. Staging operations, whichimpact the processing rate, should have higher priorityon completion than destaging operations (i.e., memorysavings are not as critical as complying with latencyrequirements), and that is what A

ION achieves with asingle thread that prioritizes I/O operations.

Q4. Can A

ION comply with maximum stalenessbounds while using resources efﬁciently?

We now assess the effectiveness of our trigger de-scribed in § xecutions m a x s t a l ene ss l l l l l l l l l l aion deltat deltaev l distribution e x e c u t i on s l no r m un i f no r m bu r s t s l no r m un i f no r m bu r s t s l no r m un i f no r m bu r s t s Figure 9: Maximum staleness, in logarithmic scale, for varying no. of executions for the log-normal distri-bution (left-side); minimum no. of executions necessary to reach a staleness bound that is 10, 5 and 1% ofthe maximum allowed lateness time, for different distributions (right-side). ing more than the necessary resources to comply withuser-deﬁned staleness limits.For a log-normal distribution of late events, the leftside of Figure 9 depicts the the maximum staleness ob-tained across executions for different triggers. A

ION is our trigger, deltat corresponds to a punctuated trig-ger that executes periodically at every time interval, and deltaev is a trigger that executes at every x events, where x is the total number of events expected divided by thenumber of executions. Our trigger achieves increasinglylower maximum staleness in relation to the standardtriggers deltat and deltaev for the same amount of exe-cutions. Moreover, the standard triggers take more exe-cutions than A ION to reach the bounds of 0 . . .

01 within 20 executions.The right side of Figure 9 shows that our trigger isalso effective for other distributions of late events. Apartfrom the log-normal ( lnorm ) , we considered ( uni f ) adistribution that makes late events uniformly distributedacross time; ( norm ) a normal distribution of events;and ( bursts ) a mix of normal distributions that gener-ate bursts of late events. We show for each distributionwhat is the minimum number of executions to reach theconsidered bounds (0 . , . , . deltat trigger is as good as A ION for the unif ,since it places the executions uniformly distributed intime, following the same trend of late event arrival. How-ever, a uniform distribution is not realistic: the arrivalof late events tends to have a more irregular behavior(due to temporary disconnected devices, network delays,etc.). For the other distributions, A

ION reached all thebounds with less executions than those of standard trig-gers. The major gain was for the log-normal distributionwith a bound of 0 .

05, where A

ION performed only 31and 27% of the executions of deltat and deltaev respec- tively. Moreover, standard triggers failed to reach thesmall bound of 0 .

01 for lnorm within 30 executions. Thismeans that A

ION is able to comply with small stalenessbounds at the minimum possible number of executions.

Stream processing has been researched for sometime [25]. Despite its maturity, there has been a recentsurge in interest, mainly due to necessity of process-ing large amounts of data in real-time [5]. SPSs thatoperate with a clear semantic of event-time with em-phasis on correctness have emerged only in the last fewyears [10, 21]. Even more recently, modern SPSs startedacknowledging dealing with lateness, such as GoogleCloud Dataﬂow [10] and its predecessor Millwheel [9]that refer to the difﬁculty of picking a maximum allowedlateness, yet always leaving the task to the developer.State spilling has been proposed to handle memoryoverloaded operators by transferring parts of the statefrom memory to disk [19]. This state spilling is lim-ited to non-window operators, despite authors acknowl-edging that tackling window constraints would requireinterleaving in-memory execution with disk manage-ment, and would bring a new set of challenges, such asthe timing of spill, timing of clean-up, and selection ofdata to clean-up. Our work addresses these challenges,that have remained unresolved until now [26]. Partic-ularly, we offer a comprehensive solution to deal withthe problematic of lateness, where we go beyond a solu-tion that simply spills data to disk naively. We managestate across memory and disk with proactive caching ,avoiding processing rate penalty due to I/O overhead,and predictive cleanup , releasing resources when they13re estimated as not needed anymore. Also, we offer atrigger that minimizes staleness while using resourcesefﬁciently.When broadening the scope of the comparison to othertypes of systems, a few have addressed the issue of han-dling late events. One such example is Photon [11],which is a system deployed at Google for joining theclick-stream with ads, provides exactly-once semanticson unordered streams, coupled with robust fault toler-ance. The design of Photon is quite different from thestream processing engines we are considering (e.g., ituses Paxos), since it is a speciﬁc solution developed fora few critical applications. In particular, it is not clearhow their solutions would apply to existing distributedstream processing systems.Another example is Samza [22], a stream processingsystem created at Linkedin. Without scaling horizontallywith more containers, Samza acknowledges that diskspilling is necessary in order to scale to large state, how-ever the authors refer to this as an orthogonal problemto their approach and do not provide a concrete solution.Li et al. [18] argue that setting an appropriate maxi-mum lateness (referred as slack) is extremely difﬁcult inpractice. Therefore, they propose out-of-order process-ing, together with stream punctuation for watermarking,as a solution. However, the design and implementationof the watermarking scheme are not discussed in detail,and late events are never considered. In contrast, ourproposal presents the design and implementation of acomplete solution in the context of real-world, non-idealwatermarks and late events.Finally, fault-tolerance and checkpointing are related,but orthogonal topics: tolerating machine failures maybe done by storing state in a persistent medium; however,the solutions that are used for tolerating faults do notnecessarily apply to the problem tackled in this paper(e.g., such solutions do not involve ofﬂoading state frommemory). This is the case, in particular, for the solutionused by Flink [13, 14].

This paper presented A

ION , a comprehensive solutionto deal with late events, tailored to memory-intensivelong-lived windows with potentially large periods oftolerated lateness. First, A

ION ofﬂoads window statefrom memory to disk and recovers it through proactivecaching at strategic times. Second, A

ION estimates thebest maximum allowed lateness based on the continu- ous observation of the distribution of late events overtime ( predictive cleanup ). Finally, A

ION provides acustomized trigger for past windows that is able to deter-mine the execution times that minimize result stalenessat a minimum amount of executions (necessary to com-ply with user-speciﬁed staleness bounds).Experimental evaluation indicates that A

ION is capa-ble of maintaining sustainable levels of memory utiliza-tion while still preserving high throughput, low latency,and low staleness.

References [1] Aion benchmarks. https://github.com/sesteves/aion-benchmarks . Accessed: Feb 2018.[2] Apache Apex. http://apex.apache.org/ . Accessed: Feb 2018.[3] Apache Samza. http://samza.apache.org/ . Accessed: Feb 2018.[4] Apache Spark. http://spark.apache.org/ . Accessed: Feb 2018.[5] Apache Storm. http://storm.apache.org/ . Accessed: Feb 2018.[6] Flink 1.1.1 with A

ION backend. https://github.com/sesteves/flink . Accessed: Feb 2018.[7] Google Cloud Dataﬂow. https://cloud.google.com/dataflow/ .Accessed: Feb 2018.[8] Stock market example. https://flink.apache.org/news/2015/02/09/streaming-example.html . Accessed: Oct 2017.[9] T. Akidau, A. Balikov, K. Bekiro˘glu, S. Chernyak, J. Haberman, R. Lax,S. McVeety, D. Mills, P. Nordstrom, and S. Whittle. Millwheel: fault-tolerant stream processing at internet scale.

Proceedings of the VLDBEndowment , 6(11):1033–1044, 2013.[10] T. Akidau, R. Bradshaw, C. Chambers, S. Chernyak, R. J. Fern´andez-Moctezuma, R. Lax, S. McVeety, D. Mills, F. Perry, E. Schmidt, andS. Whittle. The Dataﬂow Model: A Practical Approach to Balancing Cor-rectness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-OrderData Processing.

Proceedings of the VLDB Endowment , 8:1792–1803,2015.[11] R. Ananthanarayanan, V. Basker, S. Das, A. Gupta, H. Jiang, T. Qiu,A. Reznichenko, D. Ryabkov, M. Singh, and S. Venkataraman. Photon:Fault-tolerant and scalable joining of continuous data streams. In

Pro-ceedings of the 2013 ACM SIGMOD International Conference on Man-agement of Data , SIGMOD ’13, pages 577–588, New York, NY, USA,2013. ACM.[12] A. Arasu, M. Cherniack, E. Galvez, D. Maier, A. S. Maskey, E. Ryvkina,M. Stonebraker, and R. Tibbetts. Linear road: A stream data managementbenchmark. In

Proceedings of the Thirtieth International Conference onVery Large Data Bases - Volume 30 , VLDB ’04, pages 480–491. VLDBEndowment, 2004.[13] P. Carbone, S. Ewen, G. F´ora, S. Haridi, S. Richter, and K. Tzoumas.State management in apache ﬂink®: Consistent stateful distributedstream processing.

Proc. VLDB Endow. , 10(12):1718–1729, Aug. 2017.[14] P. Carbone, G. F´ora, S. Ewen, S. Haridi, and K. Tzoumas. Lightweightasynchronous snapshots for distributed dataﬂows. arXiv preprintarXiv:1506.08603 , 2015.[15] P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, andK. Tzoumas. Apache ﬂink TM : Stream and batch processing in a singleengine. IEEE Data Eng. Bull. , 38(4):28–38, 2015.

16] D. R. Jefferson. Virtual time.

ACM Transactions on Programming Lan-guages and Systems (TOPLAS) , 7(3):404–425, 1985.[17] J. Li, D. Maier, K. Tufte, V. Papadimos, and P. A. Tucker. Semantics andevaluation techniques for window aggregates in data streams. In

Proceed-ings of the 2005 ACM SIGMOD international conference on Managementof data , pages 311–322. ACM, 2005.[18] J. Li, K. Tufte, V. Shkapenyuk, V. Papadimos, T. Johnson, and D. Maier.Out-of-order processing: a new architecture for high-performance streamsystems.

Proceedings of the VLDB Endowment , 1(1):274–288, 2008.[19] B. Liu, Y. Zhu, and E. Rundensteiner. Run-time operator state spilling formemory intensive long-running queries. In

Proceedings of the 2006 ACMSIGMOD International Conference on Management of Data , SIGMOD’06, pages 347–358, New York, NY, USA, 2006. ACM.[20] M. R. N. Mendes, P. Bizarro, and P. Marques. A performance study ofevent processing systems. In R. Nambiar and M. Poess, editors,

Per-formance Evaluation and Benchmarking , pages 221–236, Berlin, Heidel-berg, 2009. Springer Berlin Heidelberg.[21] D. G. Murray, F. McSherry, R. Isaacs, M. Isard, P. Barham, and M. Abadi.Naiad: a timely dataﬂow system. In

Proceedings of the Twenty-FourthACM Symposium on Operating Systems Principles , pages 439–455. ACM,2013.[22] S. A. Noghabi, K. Paramasivam, Y. Pan, N. Ramesh, J. Bringhurst,I. Gupta, and R. H. Campbell. Samza: Stateful scalable stream processingat linkedin.

Proc. VLDB Endow. , 10(12):1634–1645, Aug. 2017.[23] S. Ruder. An overview of gradient descent optimization algorithms.

CoRR , abs/1609.04747, 2016.[24] U. Srivastava and J. Widom. Flexible time management in data streamsystems. In

Proceedings of the Twenty-third ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems , PODS ’04, pages263–274, New York, NY, USA, 2004. ACM.[25] R. Stephens. A survey of stream processing.

Acta Informatica , 34(7):491–541, 1997.[26] Q.-C. To, J. Soto, and V. Markl. A survey of state management in big dataprocessing systems. arXiv preprint arXiv:1702.01596 , 2017.[27] Y.-C. Tu, S. Liu, S. Prabhakar, and B. Yao. Load shedding in streamdatabases: A control-based approach. In

Proceedings of the 32Nd Interna-tional Conference on Very Large Data Bases , VLDB ’06, pages 787–798.VLDB Endowment, 2006.[28] Q. Yang and H. N. Koutsopoulos. A microscopic trafﬁc simulator for eval-uation of dynamic trafﬁc management systems.

Transportation ResearchPart C: Emerging Technologies , 4(3):113 – 129, 1996., 4(3):113 – 129, 1996.