A Synopses Data Engine for Interactive Extreme-Scale Analytics
aa r X i v : . [ c s . D B ] M a y (c) Owner 2019. This is the authors ’ version of the work. It is posted here for your personal use only. Not for redistribution. A Synopses Data Engine for Interactive Extreme-Scale Analytics ∗ Antonis Kontaxakis
Athena Research CenterTechnical University of [email protected]@softnet.tuc.gr
Nikos Giatrakos
Athena Research CenterTechnical University of [email protected]@softnet.tuc.gr
Antonios Deligiannakis
Athena Research CenterTechnical University of [email protected]@softnet.tuc.gr
ABSTRACT
In this work, we detail the design and structure of a Synopses DataEngine (SDE) which combines the virtues of parallel processingand stream summarization towards delivering interactive analyticsat extreme scale. Our SDE is built on top of Apache Flink and imple-ments a synopsis-as-a-service paradigm. In that it achieves (a) con-currently maintaining thousands of synopses of various types forthousands of streams on demand, (b) reusing maintained synopsesamong various concurrent workflows, (c) providing data summa-rization facilities even for cross-(Big Data) platform workflows, (d)pluggability of new synopses on-the-fly, (e) increased potential forworkflow execution optimization. The proposed SDE is useful forinteractive analytics at extreme scales because it enables (i) en-hanced horizontal scalability, i.e., not only scaling out the computa-tion to a number of processing units available in a computer cluster,but also harnessing the processing load assigned to each by oper-ating on carefully-crafted data summaries, (ii) vertical scalability,i.e., scaling the computation to very high numbers of processedstreams and (iii) federated scalability i.e., scaling the computationbeyond single clusters and clouds by controlling the communica-tion required to answer global queries posed over a number of po-tentially geo-dispersed clusters.
ACM Reference Format:
Antonis Kontaxakis, Nikos Giatrakos, and Antonios Deligiannakis. 2020. (c) Owner 2019. This is the authors ’ version of the work. It is posted here for your personal use only. Not for redistribution. A Synopses Data Engine for Interactive Extreme-Scale Analytics. In
Pro-ceedings of ACM Conference (Conference’17).
ACM, New York, NY, USA,13 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn ∗ This work has received funding from the EU Horizon 2020 research and innovationprogram INFORE under grant agreement No 825070.Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected].
Conference’17, July 2017, Washington, DC, USA © 2020 Association for Computing Machinery.ACM ISBN 978-x-xxxx-xxxx-x/YY/MM...$15.00https://doi.org/10.1145/nnnnnnn.nnnnnnn
Interactive extreme-scale analytics over massive, high speed datastreams become of the essence in a wide variety of modern appli-cation scenarios. In the financial domain, NYSE alone generatesseveral terrabytes of data a day, including trades of thousands ofstocks [6]. Stakeholders such as authorities and investors need toanalyze these data in an interactive, online fashion for timely mar-ket surveillance or investment risk/opportunity identification pur-poses. In the life sciences domain, studying the effect of applyingcombinations of drugs on simulated tumors of realistic sizes cangenerate cell state data of 100 GB/min [25], which need to be an-alyzed online to interactively determine successive drug combina-tions. In maritime surveillance applications, one needs to fuse high-velocity position data streams of hundreds of thousands of vesselsacross the globe and satellite, aerial images [31] of various reso-lutions. In all these scenarios, data volumes and rates are only ex-pected to rise in the near future. In the financial domain, data fromemerging markets, such as crypto-currencies, are increasingly addedto existing data sources. In life sciences, simulations are becomingprogressively more complex, involving billions of interacting cells,while in the maritime domain autonomous vehicles are added ason-site sensing information sources.To enable interactive analytics at extreme-scale, stream process-ing platforms and systems need to provide three types of scalabil-ity: • Horizontal scalability, i.e., the ability to scale the computationwith extreme data volumes and data arrival rates as analyzedin the aforementioned scenarios. This requires scaling out thecomputation to a number of machines and respective processingunits available at a corporate data center (cluster) or cloud. Hori-zontal scalability is achieved by parallelizing the processing andadaptively assigning computing resources to running analyticsqueries. • Vertical scalability, i.e., the ability to scale the computation withthe number of processed streams. For instance, to detect sys-temic risks in the financial scenario, i.e., stock level events thatcould trigger instability or collapse of an entire industry or econ-omy, requires discovering and interactively digging into correla-tions among tens of thousands of stock streams. The probleminvolves identifying the highly correlated pairs of stock datastreams under various statistical measures, such as Pearson’scorrelation over N distinct, high speed data streams, where N onference’17, July 2017, Washington, DC, USA Antonis Kontaxakis, Nikos Giatrakos, and Antonios Deligiannakis is a very large number. To track the full Θ ( N ) correlation ma-trix results in a quadratic explosion in space and computationalcomplexity which is simply infeasible for very large N . The prob-lem is further exacerbated when considering higher-order sta-tistics (e.g., conditional dependencies/correlations). The same is-sue arises in the maritime surveillance scenario for trajectorysimilarity scores over hundreds of thousands of vessels. Clearly,techniques that can provide vertical scaling are sorely neededfor such scenarios. • Federated scalability, i.e., the ability to scale the computation insettings where data arrive at multiple, potentially geographicallydispersed sites. On the one hand, a number of benchmarks [29,33] conclude that, in such settings, even if horizontal scalabilityis ensured within each cluster, the maximum achieved through-put (number of streaming tuples that are processed per time unit)is network bound. On the other hand, consider again the sys-temic risk detection scenario from the financial domain wherestock trade data arrive at geo-dispersed data centers around theglobe. Moving entire data streams around the sites in order toextract pairwise correlation scores depletes the available band-width, introducing network latencies that prevent the interactiv-ity of the desired analytics.Big Data platforms, including Apache Flink [2], Spark [4], Storm [5]among others, have been developed that support or are especiallydedicated to stream processing. Such platforms focus on horizontalscalability, but they are not sufficient by themselves to allow for therequired vertical and federated scalability. On the other hand, thereis a wide consensus in stream processing [17, 19–21, 23, 30, 34] thatapproximate but rapid answers to analytics tasks, more often thannot, suffice. For instance, knowing in real-time that a group of ap-proximately 50 stocks, extracted out of thousands or millions ofstock combinations, is highly (e.g., > . OR operation.In this work, we detail the design and structure of a SynopsesData Engine (SDE) built on top of Apache Flink ingesting streamsvia Apache Kafka [3]. Our SDE combines the virtues of parallel pro-cessing and stream summarization towards delivering interactiveanalytics at extreme scale by enabling enhanced horizontal, verti-cal and federated scalability as described above. However, the pro-posed SDE goes beyond that. Our design implements a Synopsis-as-a-Service (termed SDEaaS) paradigm where the SDE can servemultiple, concurrent application workflows in which each main-tained synopsis can be used as an operator. That is, our SDE oper-ates as a single, constantly running Flink job which achieves:A. concurrently maintaining thousands of synopses for thousandsof streams on demand,B. reusing maintained synopses among multiple application work-flows (submitted jobs) instead of redefining and duplicatingstreams for each distinct workflow separately,C. pluggability of new synopses’ definitions on-the-fly,D. providing data summarization facilities even for cross-(Big Data)platform workflows [28] outside of Flink,E. optimization of workflows execution by enabling clever datapartitioning,F. advanced optimization capabilities to minimize workflow exe-cution times by replacing exact operators (aggregations, joinsetc) with approximate ones, given a query accuracy budget tobe spent.Few prior efforts provide libraries for online synopses mainte-nance, but neglect parallelization aspects [8, 9], or lack a SDEaaSdesign [7] needing to run a separate job for each maintained syn-opsis. The latter compromises aspects in points A-F above and in-creases cluster scheduling complexity. Others [32] lack architec-tural provisions for federated scalability and are limited to servingsimple aggregation operators being deprived of vertical scalabil-ity features as well. On the contrary, our proposed SDE not onlyincludes provisions for federated scalability and provides a rich li-brary of synopses to be loaded and maintained on the fly, but alsoallows to plug-in external, new synopsis definitions customizingthe SDE to application field needs. More precisely, our contribu-tions are:(1) We present the novel architecture of a Synopses Data Engine(SDE) capable of providing interactivity in extreme-scale ana-lytics by enabling various types of scalability. c) Owner 2019. This is the authors ’ version of the work. It is posted here for your personal use only. Not for redistribution. A Synopses Data Engine for Interactive Extreme-Scale AnalyticsConference’17, July 2017, Washington, DC, USA (2) Our SDE is built using a SDE-as-a-Service (SDEaaS) paradigm,it can efficiently maintain thousands of synopses for thousandsof streams to serve multiple, concurrent, even cross-(Big Data)platform, workflows.(3) We describe the structure and contents of our SDE Library, theimplemented arsenal including data summarization techniquesfor the proposed SDE, which is easily extensible by exploitinginheritance and polymorphism.(4) We discuss insights we gained while materializing a SDEaaSparadigm and outline lessons learned useful for future, similarendeavors.(5) We showcase how the proposed SDE can be used in workflowsto serve a variety of purposes towards achieving interactivedata analytics.(6) We present a detailed experimental analysis using real datafrom the financial domain to prove the ability of our approachto scale at extreme volumes, high number of streams and de-grees of geo-distribution, compared to other candidate approaches.
From a research viewpoint, there is a large number of related workson data synopsis techniques. Such prominent techniques, cited inTable 1, are already incorporated in our SDE and, some of them,are further discussed in practical examples in Section 7. Please re-fer to [17, 20, 23] for comprehensive views on relevant issues.Yahoo!DataSketch [9] and Stream-lib [8] are software librariesof stochastic streaming algorithms and summarization techniques,correspondingly. These libraries are detached from parallelizationand distributed execution aspects, contrary to the SDE we pro-pose in this work. Apache Spark [4], provides utilities for data syn-opsis via sampling operators, CountMin sketches and Bloom Fil-ters. Similarly, Proteus [7] extends Flink with data summarizationutilities. Spark and Proteus combine the potential of data summa-rization with parallel processing over Big Data platforms by pro-viding libraries of data synopsis techniques. Compared to these,first, we provide a richer library of data summarization techniques(Table 1) which covers all types of scalability mentioned in Sec-tion 1. Second, we propose a novel architecture for implementinga SDEaaS paradigm which allows synopses to be loaded from in-ternal libraries or get plugged from external libraries on-the-fly,as the service is up and running. Third, as we detail in Section 6,our SDEaaS paradigm and architecture enable the simultaneousmaintenance of thousands of synopses of different types for thou-sands of streams which (i) might not even be possible withoutour SDEaaS paradigm, (ii) allows various running workflows toshare/reuse currently maintained synopses and thus prevents du-plicating the same data and synopses for each workflow, (iii) re-duces the load of cluster managers compared to accomplishingthe same task, but lacking our SDEaaS design. Finally, Snappy-Data’s [32] stream processing is based on Spark. SnappyData’s SDEis limited to serving simple
SUM , COUNT and
AVG queries being de-prived of vertical scalability features and federated scalability ar-chitectural provisions. https://snappydatainc.github.io/snappydata/aqp/ In this section, we outline the functionality that our SDE API pro-vides to upstream (i.e., contributing input to) and downstream (re-ceiving input from) operators and application interfaces of a givenBig Data processing pipeline engaging synopses. All requests aresubmitted to the SDE at runtime, given the SDEaaS nature of ourdesign, via lightweight, properly formatted JSON snippets [16] toensure cross-(Big Data) platform compatibility. The JSON snippetof each request listed below includes a unique identifier for thequeried stream or source incorporating multiple streams (see Sec-tion 4) and a unique id for the synopsis to be loaded/created/queried.In case of a create or load synopsis request (see below and Table 1),the parameters of the synopsis as well as a pair of parameters in-volving the employed parallelization degree and scheme (see Sec-tion 4) are also included in the JSON snippet. In federated architec-tures where multiple, geo-dispersed clusters run local SDEaaS in-stances and estimations provided by synopses need to be collectedat a cluster afterwards, the JSON snippet also includes the addressof that cluster. The SDE API provides the following facilities:
Build/Stop Synopsis (Request) . A synopsis can be created orceased on-the-fly, as the SDE is up and running. In that, the execu-tion of other running workflows that utilize synopsis operators, isnot hindered. A synopsis may be (a) a single-stream synopsis, i.e., asynopsis (e.g. sample) maintained on the trades of a single stock, or(b) a data source synopsis, i.e., a synopsis maintained on all tradesirrespectively of the stock. Moreover,
Build/Stop Synopsis al-lows submitting a single request for maintaining a synopsis of thesame kind, for each out of multiple streams coming from a certainsource. For instance, maintaining a sample per stock for thousandsof stocks coming from the same source requires the submission ofa single request. A condensed view of a JSON snippet for buildinga new synopsis is illustrated in Figure 1.
Load Synopsis (Request) . The SDE Library (Section 5) incorpo-rates a number of synopsis operators, commonly used in practicalscenarios.
Load Synopsis supports pluggability of the code of ad-ditional (not included in the SDE Library) synopses, their dynamicloading and maintenance at runtime. The structure of the SDE Li-brary, utilizing inheritance and polymorphism, is key for this task.This is an important feature since it enables customizing the SDEto application specific synopses without stopping the service.
Ad-hoc Query (Request) . The SDE accepts one-shot, ad-hoc querieson a certain synopsis and provides respective estimations (approx-imate answers) to downstream operators or application interfaces,based on its current status.
Continuous Querying . Continuous queries can be defined togetherwith a
Build/Stop Synopsis request. In this case, an estimationof the approximated quantities, such as counts, frequency momentsor correlations are provided every time the estimation of the syn-opsis is updated, for instance, due to reception of a new tuple.The response to ad-hoc or continuous queries is also providedin lightweight JSON snippets including: (i) a key, value pair foruniquely identifying the provided response (e.g. from past and fu-ture ones) and for the value of the estimated quantity, respectively,(ii) the id of the request that generated the response, (iii) the iden-tifier of the utilized synopses along with its parameters (Table 1).
SDE Status Report . The API allows querying the SDE about its onference’17, July 2017, Washington, DC, USA Antonis Kontaxakis, Nikos Giatrakos, and Antonios Deligiannakis status, returning information about the currently maintained syn-opses and their parameters. This facility is useful during the defini-tion of new workflows, since it allows each application to discoverwhether it can utilize already maintained data synopses and reusesynopses serving multiple workflows.
In this section we detail the SDE architectural components andpresent their utility in serving the operations specified in Section 3.
Our architecture is built on top of Apache Flink [2] and Kafka [3].Kafka is used as a fast, scalable, durable, and fault-tolerant publish-subscribe messaging system enabling connectivity between the SDEand upstream, downstream operators in the workflows served bythe SDE. Kafka together with the JSON format of accepted requestsnippets allows us to materialize the SDEaaS paradigm even whenupstream or downstream operators run on different Big Data plat-forms. Furthermore, it is used as a messaging service in case ofquerying synopses maintained at a number of geo-dispersed clus-ters. A Kafka cluster is composed of a number of brokers, run inparallel, that handle separate partitions of topics. Topics constitutecategories of data where producers and consumers can write andread, respectively. In the case of the SDE, producers constitute up-stream operators, while downstream operators act as consumers.Furthermore, in geo-dispersed, multi-cluster settings, the SDE in-stances run at each cluster may be the producers or consumers of aparticular Kafka topic as will be explained later on in this section.A Flink cluster is composed of (at least one) Master and a num-ber of Worker nodes. The Master node runs a JobManager for dis-tributed execution and coordination purposes, while each Workernode incorporates a TaskManager which undertakes the physicalexecution of tasks. Each Worker (JVM process) has a number oftask slots (at least one). Each Flink operator may run in a number ofinstances, executing the same code, but on different data partitions.Each such instance of a Flink operator is assigned to a slot andtasks of the same slot have access to isolated memory shared onlyamong tasks of that slot. Figure 2 provides a condensed view of theSDE architecture, which engages
Map , FlatMap , CoFlatMap , Union and
Split
Flink operators. In a nutshell, a
Map operator takes onetuple and produces another tuple in the output, a
FlatMap opera-tor takes one tuple and produces zero, one, or more tuples, whilea
CoFlatMap operator hosts two
FlatMap that share access to com-mon variables (therefore the linking icon in the figure) among streamsthat have previously been connected (using a
Connect operator inFlink). Finally, a
Union operator receives two or more streams andcreates a new one containing all their elements, while a
Split oper-ator splits the stream into two or more streams according to somecriterion.Section 4.2 explains the reason for the above design and explainsthe flow of information in different uses of the SDE.
Employed Parallelization Scheme(s) . The parallelization schemethat is employed in the design of the SDE is partition-based paral-lelization [24]. That is, every data tuple that streams in the SDE architecture and is destined to be included in a maintained synop-sis, does so based on the partition key it is assigned to it. Whena synopsis is maintained for a particular stream (i.e., per stock -see Section 3) the key that is assigned to the respective update(newly arrived data tuple) is the identifier of that particular streamfor which the synopsis is maintained. In this case, within the dis-tributed computation framework of Flink, that stream is processedby a task of the same worker and parallelization is achieved by dis-tributing the number of streams for which a synopsis is built, tothe available workers in the cluster hosting the SDE. On the otherhand, when a synopsis involves a data source (i.e., financial datasource for all monitored stock streams - see Section 3), the desireddegree of parallelism is included as a parameter in the respectiverequest to build/start maintaining the synopsis. In the latter case,one dataset is partitioned to the available workers in a round-robinfashion and the respective keys are created by the SDE (details onthat follow shortly) each of which points (is hashed) to a particularworker. Finally, in case of processing streaming windows (either tu-ple or count-based) [24] an incoming tuple may (i) initiate a newwindow, (ii) be assigned to one or more existing windows or (iii)terminate a window. Here, the partition is the window itself andthe tuple is given the key(s) of the window(s) it affects.
Data and Query Ingestion . Data and request (JSON snippet) streamsarrive at a particular Kafka topic each. In the case of the
DataTopic of Figure 2, a parser component is used in order to extract the keyand value field(s) on which a currently running synopsis is main-tained. The respective parser of the
RequestTopic topic of Figure 2reads the JSON snippet of the request and processes it. When anincoming request involves the maintenance of a new synopsis, theparser component extracts information about the parameters ofthe synopsis (see Table 1) and its nature, i.e. whether it is on asingle stream, on a data source, if it involves a multi-stream syn-opsis maintenance request or a synopsis that is also maintained inSDE instances in other geo-dispersed clusters. In case the requestis an ad-hoc query the parser component extracts the correspond-ing synopsis identifier(s).
Requesting New Synopsis Maintenance . When a request is is-sued for maintaining a new synopsis, it initially follows the red-colored paths of the SDE architecture in Figure 2. That is, the cor-responding parser sends the request to a
FlatMap operator(termed
RegisterRequest at the bottom of Figure 2) and to another
FlatMap operator (
RegisterSynopsis ) which is part of a
CoFlatMap one.
RegisterRequest and
RegisterSynopsis produce the keys (asanalyzed in the description of the supported parallelization schemes)for the maintained synopsis, but provide different functionality.The
RegisterRequest operator uses these keys in order to later de-cide which worker(s) an ah-hoc query, which also follows the red-colored path, as explained shortly, should reach. On the other hand,the
RegisterSynopsis operator uses the same keys to decide towhich worker(s) a data tuple destined to update one or more syn-opses, which follows the blue-colored path in Figure 2, should be di-rected. The possible parallelization degree of the
RegisterSynopsis and
RegisterRequest operators up to this point of the architec-ture depends on the number of running synopses.
Updating the Synopsis . When a data tuple destined to updateone or more synopses is ingested via the
DataTopic of Kafka itfollows the blue-colored path of the SDE architecture in Figure 2. c) Owner 2019. This is the authors ’ version of the work. It is posted here for your personal use only. Not for redistribution. A Synopses Data Engine for Interactive Extreme-Scale AnalyticsConference’17, July 2017, Washington, DC, USA (cid:141)(cid:129)(cid:141) %&(cid:157)’ 16 (cid:3)
A (cid:3) &-5*2 (cid:3) &0* (cid:3) &; -<,",F;< (cid:3)
G (cid:3)
H1*1!! (cid:51)(cid:68)(cid:85)(cid:68)(cid:80)(cid:86)(cid:3)(cid:54)(cid:92)(cid:81)(cid:82)(cid:83)(cid:86)(cid:76)(cid:86)
H1*16 (cid:86)
Figure 1: JSON snippet for
BuildSynopsis request.
Data Topic
Request
Topic parseparse Register RequestRegister
Synopsis
HashData addestimate splitter Output
Topic mergeCoFlatMapFlatMap CoFlatMap FlatMap — Data Path — Requests Path — Mergeable
Synopsis
Estimation — Single-stream
Synopsis Estimation
MapMap
Split
Union
Topic
Uniofederator
Union — FederatedSynopsis
To geo-dispersed
Union Topic
Figure 2: SDE Architecture – Condensed View.
The tuple is directed to the
HashData FlatMap of the correspond-ing
CoFlatMap where the keys (stream identifier for single streamsynopsis and/or worker identifier for data source synopsis and win-dowing operations) are looked up based on what
RegisterSynopsis has created. Following the blue-colored path, the tuple is directedto a add FlatMap operator which is part of another
CoFlatMap .The add operator updates the maintained synopsis as prescribedby the algorithm of the corresponding technique. For instance, incase a FM sketch [22] is maintained, the add operation hashes theincoming tuple to a position of the maintained bitmap and turnsthe corresponding bit to 1 if it is not already set.
Ad-hoc Query Answering . An ad-hoc query arrives via the
Reque - stTopic of Kafka and is directed to the RegisterRequest oper-ator. The operator which has produced the keys using the samecode as
RegisterSynopsis does, looks up the key(s) of the queriedsynopsis and directs the corresponding request to the estimateFlatMap operator of the corresponding
CoFlatMap . The estimate operator reads via the shared state the current status of the main-tained synopsis and extracts the estimation of the correspondingquantity the synopsis is destined to provide. For instance, uponperforming an ad-hoc query on a FM sketch [22], the estimate operator reads the maintained bitmap, finds the lowest positionof the unset bit and provides a distinct count estimation by usingthe index of that position and a ϕ = .
77 coefficient. Table 1 sum-marizes the estimated quantities each of the currently supportedsynopses can provide.
Continuous Query Answering . In case continuous queries areto be executed on the maintained synopses, a new estimation needsto be provided every time the estimation of the synopsis is updated,either via an add operation or because a window on the data ex-pires. In this particular occasion estimate needs to be invoked by add .Both in ad-hoc and continuous querying, the result of estimate ,following the red path in Figure 2, is directed to a
Split operator,termed splitter . If necessary, the splitter forwards estimationsto a
Union operator, termed federator which reads from a
Union
Kafka topic (yellow path in Figure 2). The
Union
Kafka topic andthe federator involve our provisions for maintaining federatedsynopses, i.e., synopses that are kept at a number of potentially geo-dispersed clusters. The splitter distinguishes between threecases.
Case 1:
Case 1 happens when estimate involves a single-stream synopsis maintained only locally at a cluster. Then,
Split directs the output to downstream operators of the executed work-flow via Kafka, by following the green-colored path in Figure 2.
Case 2:
Case 2 arises when a federated synopsis is queried but therequest has identified another cluster responsible for extracting theoverall estimation. Then,
Split acts as the producer (writes) tothe geo-dispersed
Union
Kafka topic of another cluster (declaredby the dotted, yellow arrow coming out of splitter in Figure 2).Let us now see
Case 3:
For non-federated synopses defined onentire data sources (e.g., a sample over all stock data), a numberof workers of the current cluster participate in the employed par-allelization scheme as discussed at the beginning of Section 4.2.Thus, each such worker provides its local synopsis/estimation. Be-cause something similar holds when individual clusters maintainfederated synopses and the current cluster is set as responsible forsynthesizing the overall estimation, in both cases the output of the
Split operator is directed via
Union to a merge FlatMap followingthe purple-colored path. The merge operator merges the partial re-sults of the various workers and/or clusters and produces the finalestimation which is streamed to downstream operators, again viaan
Output
Kafka topic. For instance, FM sketches [22] or Bloom Fil-ters [14] (bitmaps) can be merged via simple logical disjunctions orconjunctions. At this point, in order to direct all partial estimates tothe same worker of a cluster to perform the merge operation, a cor-responding identifier for the issued request (for ad-hoc queries) oran identifier for the maintained synopsis (for continuous queries)is used as the key.
The internal structure of the synopses library is illustrated in Fig-ure 3 which provides only a partial view of the currently supportedsynopses for readability purposes. Table 1 provides a full list ofcurrently supported synopses, their utility in terms of estimatedquantities and their parameters. The development of the SDE Li-brary exploits subtype polymorphism in Java in order to ensurethe desired level of pluggability for new synopses definitions. onference’17, July 2017, Washington, DC, USA Antonis Kontaxakis, Nikos Giatrakos, and Antonios Deligiannakis
Synopsis Estimation ParametersCountMin [19] Count/Frequency Estimation ϵ , δ BloomFliter [14] Set Membership ϵ , δ HyperLogLog [21] Distinct Count Relative Standard ErrorAMS Sketch [12] L -Norm, ϵ , δ Inner ProductDiscrete Fourier Correlation, BucketID Similarity Threshold,Transform (DFT) [34] Number of CoefficientsRandom Hyperplane Correlation, BucketID Bitmap Size, Similarity ThresholdProjection (RHP) [15, 26] Number of BucketsLossy Counting [30] Count, Frequent Items ϵ Sticky Sampling [30] Count, Frequent Items support, ϵ , δ Chain Sampler [13] Sample Sample SizeGKQuantiles [27] Quantiles ϵ CoreSetTree [10] CoreSets Bucket size, dimensionality
Table 1: Supported synopses. ϵ is the approximation errorbound. δ is the probability of failing to achieve ϵ accuracy.For synopses that can be maintained over a window, respec-tive parameters for window definition are added.Figure 3: Structure of the Synopses Library (partial view). As shown in Figure 3, there is a higher level class called
Synopsis with attributes related to a unique identifier and a couple of strings.The first string holds the details of the request (JSON snippet) withrespect to how the synopsis should be physically implemented, i.e.,index of the key field in an incoming data tuple (for single streamsynopsis), the respective index of the value field which the sum-mary is built on, whether the synopsis is a federated one and whichcluster should synthesize the overall estimation and so on. The sec-ond string holds information included in the JSON snippet regard-ing synopsis parameters as those cited in Table 1. Furthermore, the
Synopsis class includes methods for add , estimate and merge asthose were described in Section 4. Finally, a set of setters and get-ters for synopsis, key and value identifiers are provided.Every specific synopsis algorithm is implemented in a separateclass, as shown in Figure 3, that extends Synopsis and overridesthe add , estimate and merge methods with the algorithmic detailsof that particular technique [17]. Why Flink . In principle, our architectural design can be material-ized over other Big Data platforms such as Storm, Spark or KafkaStreams. The key reason for choosing Flink as the platform fora proof-of-concept implementation of the proposed architectureis the
CoFlatMap operator (transformation). As shown in the de-scription of our architecture, the fact that
CoFlatMap allows two
FlatMap operators gain access to shared variables was used bothfor generating keys and assign data to partitions processed by cer-tain workers (leftmost
CoFlatMap in Figure 2) as well as for query-ing maintained synopses via the estimate FlatMap in the middleof the figure. Although one can implement the
CoFlatMap func-tionality in other Big Data platforms, the native support providedby Flink alleviates the development effort with respect to memoryconfiguration, state management and fault tolerance.
The Red Path . Notice, that the blue-colored path in Figure 2 re-mains totally detached from the red-colored path. This depicts adesign choice we follow for facilitating querying capabilities. Thatis, since the data updates on several maintained synopses may beingested at an extremely high rate in Kafka at the beginning of theblue path, typically a lot higher than the rate at which requestsare issued in the red path, in case the two paths were crossing,back-pressure on the blue-colored path would also affect the timelyanswers to requests. Having kept the two paths independent, re-quests can be answered in a timely manner based on the currentstatus of the maintained synopses. This is also true for continuousqueries since they may be interpreted to a number of requests. ...And One SDEaaS For All . Our SDEaaS approach allows theconcurrent maintenance of thousands of synopses for thousandsof streams on demand. It further allows different application work-flows to share and reuse existing synopses instead of redefiningthem. The alternative is to submit a separate job for each (one ormore) desired synopsis in a respective workflow that uses it. Thelatter simplistic approach possesses a number of drawbacks. First,the same synopses, even with the exact same parameters, cannotbe reused/shared among currently running workflows. This meansthat data streams need to be duplicated and redundant data sum-maries are built as well. Second, one may end up submitting a dif-ferent job for each new demand for a maintained synopsis. Apartfrom increasing the load of a cluster manager, this poses restric-tions on the number of synopses that can be simultaneously main-tained. Recall from Section 4.1 that each worker in a Flink clus-ter is assigned a number of task slots and each task slot can hosttasks only of the same job. Therefore, lacking our SDEaaS approachmeans that the number of concurrently maintained synopses is atmost equal to the available task slots. As a rule-of-thumb [2], a de-fault number of task slots would be the number of available CPUcores. Thus, unless thousands of cores are available one cannotmaintain thousands of synopses for thousands of streams. Evenwhen thousands of CPU cores are available, the number of tasksthat can run in the same task slot is a multiple of the number ofsuch slots. This observation is utilized in our SDEaaS architecture.Roughly speaking, in SDEaaS a request for a new synopsis on thefly assigns new tasks for it, while lacking the SDEaaS rationaleassigns at least one entire task slot. In SDEaaS, synopses mainte-nance involves tasks running instances of the operators in Figure 2, c) Owner 2019. This is the authors ’ version of the work. It is posted here for your personal use only. Not for redistribution. A Synopses Data Engine for Interactive Extreme-Scale AnalyticsConference’17, July 2017, Washington, DC, USA instead of devoting entire task slots to each. Each synopsis by de-sign consumes limited memory and entails simple update ( add inFigure 2) operations. Thus, in SDEaaS, we have multiple, light tasksvirtually competing for task slot resources and better exploit thepotential for hyper-threading and pseudo-parallelism for the main-tained synopses. For the above reasons, SDEaaS is a much morepreferable design choice.
Kafka Topics . In Figure 2 we use four specific Kafka topics whichthe SDE consumes (
DataTopic , RequestTopic , UnionTopic ) orproduces (
OutputTopic , UnionTopic ). Our SDE is provided as aservice and constantly runs as a single Flink job (per cluster, infederated settings). Synopses are created and respective sources ofdata are added on demand, but our experience in developing theproposed SDE says that there is no reliable way of adding/removingnew Kafka topics to a Flink job dynamically, at runtime. Thereforeall data tuples, requests and outputs need to be written/read inthe respective data topics, each of which may include a numberof partitions, i.e. per stream or data source. This by no means in-troduces redundancy in the data/requests processed by the SDE,because every data tuple that arrives in the
DataTopic has no rea-son of existing there unless it updates one or more maintained syn-opses. Similarly every request that arrives in the
RequestTopic creates/queries specific synopses. The same holds for the
OutputTopic and
UnionTopic . No output is provided unless a continuous queryhas been defined for a created synopses or an ad-hoc request ar-rives. In both cases, the output is meant to be consumed by re-spective application workflows. Furthermore, internal to the SDE,nothing is consumed or produced in the
UnionTopic unless one ormore federated synopses are maintained.
Windows & Out-of-order Arrival Handling . In Flink, Sparkand other Big Data platforms, should a window operator need tobe applied on a stream, one would use a programming syntax sim-ilar to ( {streamName||operatorName}.chosenWindowOperator ).If one does that in a SDEaaS architecture, the window would be ap-plied to the entire operator, i.e.,
CoFlatMap , FlatMap and so on inFigure 2. But, in the general case, each maintained synopsis incor-porates the definition of its own window which may differ acrossdifferent currently maintained synopses, instead of the same win-dow operator applied to all synopses. Therefore, a SDEaaS designdoes not allow for using the native windowing support providedby the Big Data platform because the various windows are notknown in advance. One should develop custom code and exploitlow-level stream processing concepts provided by the correspond-ing platform (such as the
ProcessFunction in Flink [2]) to imple-ment the desired window functionality. The same holds for han-dling out-of-order tuple arrivals and the functionality provided by .allowedLateness() in Flink or similar operators in other plat-forms.
Dynamic Class Loading . YARN-like cluster managers, upon be-ing run as sessions, start the TaskManager and JobManager pro-cesses with the Flink framework classes in the Java classpath. Thenjob classes are loaded dynamically when the jobs are submitted.But what we require in a
Load Synopsis request provided by ourAPI is different. Due to the SDEaaS nature of the SDE, to material-ize
Load Synopsis we need to achieve loading classes dynamically after the SDE job has been submitted, as the service is up and run-ning. A cluster manager will not permit loading classes at runtime
Source Split Filter
Filter
Project JoinCount
Aggregative Operation
Window
ApplyThreshold/
ExtractClusters Sink
Figure 4: SDEaaS in Practice – Workflow under Study. due to security issues, i.e. class loaders are to be immutable. In or-der to bypass such issues for classes involving synopses that areexternal to our SDE Library, one needs to store the corresponding jar file in HDFS and create an own, child class loader. That is, thechild class loader must have a constructor accepting a class loader,which must be set as its parent. The constructor will be called onJVM startup and the real system class loader will be passed. Weleave testing
Load Synopsis using alternative ways (e.g. via RESTAPI), for future work. aa S & HOW IT CAN BE USED
In this section we design a specific scenario, we build a workflowthat resembles, but extends, the Yahoo! Benchmark [16] and then,we discuss how our SDE and its SDEaaS characteristics can be uti-lized so as to serve a variety of purposes. Consider our runningexample from the financial domain. The workflow of Figure 4 il-lustrates a scenario that utilizes Level 1 and Level 2 stock data aim-ing at discovering cross-correlations among and groups of corre-lated stocks. More precisely, Level 1 data involve stock trades ofthe form < Date , Time , Price , V olume > for each data tick of anasset (stock). Level 2 data show the activity that takes place beforea trade is made. Such an activity includes information about of-fers of shares and corresponding prices as well as respective bidsand prices per stock. Thus, Level 2 data are shaped like series of < Ask price , Ask volume , Bid price , Bid volume > until a trade ismade. These pairs are timestamped by the time the stock trade hap-pens. The higher the number of such pairs for a stock, the higherthe popularity of the stock. Note that, in Figure 4, we use genericoperator namings. The workflow may be specified in any Big Dataplatform, other than Flink, and still use (in ways that we describehere) the benefits of SDEaaS acting as producer (issuing requests)and consumer to the Kafka topics of Figure 2, abiding by the re-spective JSON schemata.In Figure 4 both Level 1 and Level 2 data arrive at a Source . The
Split operator separates Level 1 from Level 2 data. It directs Level2 data to the bottom branch of the workflow. There, the bids are
Filter ed (i.e., for monitoring only a subset of stocks or keep onlybids above a price/volume threshold). Then, the bids are
Count edand only this counter is kept per stock. When a trade for a stockis realized, the corresponding Level 1 tuple is directed by
Split to the upper part of the workflow. A
Project operator keeps onlythe timestamp of the trade for each stock. The
Join operator af-terwards joins the stock trade, Level 1 tuple with the count of bidsthe stock received until the trade. The corresponding result is in-serted in a time
Window of recent such counts, forming a time se-ries. The pairwise similarities of the time series or coresets [10] ofstocks are computed via an
AggregativeOperation . The resultseither in the form of pairs of stocks surpassing a similarity thresh-old (
ApplyThreshold operator in Figure 4) or clusters of stocks onference’17, July 2017, Washington, DC, USA Antonis Kontaxakis, Nikos Giatrakos, and Antonios Deligiannakis ( ExtractClusters operator in Figure 4) are directed to a
Sink tosupport relevant decision making procedures. ...as a Cost Estimator for Enhanced Horizontal Scalability .SDEaaS can act as a cost estimator that constantly collects statis-tics for streams (in this scenario, stocks) that are of interest andthese statistics can be used for optimizing the execution of any cur-rently running or new workflow. In our examined scenario, havingdesigned the workflow in Figure 4 we wish to determine an appro-priate number of workers that will be assigned for its execution,prescribing the parallelization degree, as well as balance the pro-cessing load among the dedicated workers. For that purpose a Hy-perLogLog [21] and a CountMin [19] sketch (see Table 1) can beused, i.e., our SDE constantly runs as a service and keeps HLL andCountMin sketches.HyperLogLog (HLL) sketches [21] enable the extraction of ap-proximate distinct counts using limited memory and a simple errorapproximation formula. Therefore, they are useful for estimatingthe cardinality of the set of stocks that are being monitored pertime unit. In the common implementation of HyperLogLog, eachincoming element is hashed to a 64-bit bitmap. The hash functionis designed so that the hashed values closely resemble a uniformmodel of randomness, i.e., bits of hashed values are assumed to beindependent and to have an equal probability of occurring each.The first m bits of the bitmap are used for bucketizing an incomingelement and we have an array M of 2 m buckets (also called regis-ters). The rest 64 − m bits are used so as to count the number ofleading zeros and in each bucket we store the maximum such num-ber of leading zeros to that particular bucket. To extract a distinctcount estimation, one needs to compute the harmonic mean of thevalues of the buckets. The relative error of HLL in the estimationof the distinct count is 1 /√ m . HLL are trivial to merge based onequivalent number of buckets maintained independently at eachsite/cluster. One should simply derive the maximum among thecorresponding buckets of sites.A CountMin Sketch [19] is a two dimensional array of w × d di-mensionality used to estimate frequencies of elements of a streamusing limited amount of memory. For given accuracy ϵ and errorprobability δ , w = e / ϵ ( e is the Eurler’s number) and d = loд ( / δ ) . d random, pairwise independent hash functions are chosen for hash-ing each tuple (concerning a particular stock) to a column in thesketch. When a tuple streams in, it goes through the d hash func-tions so that one counter in each row is incremented. The esti-mated frequency for any item is the minimum of the values of itsassociated counters. This provides an estimation within ϵN , when N is the sum of all frequencies so far (in the financial dataset), withprobability at least 1 − δ . CountMin sketches are easily mergeableby adding up the corresponding arrays.An intrinsic optimizer can use SDEaaS as the cost estimator,derive the cardinality of the set of stocks that need to be moni-tored per time unit by querying the HLL sketch. Moreover, theCountMin sketch can be queried for estimating the frequency ofeach stock. Based on the HLL estimation the optimizer knows howmany pieces of work need to be assigned to the workers. And basedon the frequency of each stock, the size of each piece of work isalso known. Therefore, the optimizer can configure the numberof workers and balance the load among them, for instance, by us-ing a Worst Fit Decreasing Bin-packing approach [24]. Horizontal scalability is enhanced compared to what is provided by the BigData platform alone. This is due to having apriori (provided by theSDEaaS nature of the engine) adequate statistics to ensure that noworker is overloaded causing reduction in the overall throughputduring the execution of the workflow. ...for Locality-aware Hashing & Vertical Scalability . Considerthat the AggregativeOperation in Figure 4 involves computingpairwise similarities of stock bid count time series based on Pear-son’s Correlation Coefficient. As discussed in Section 1, trackingthe full correlation matrix results in a quadratic explosion in spaceand time which is simply infeasible for very large number of mon-itored stocks. Let us now see how the DFT synopsis (Table 1) canbe used for performing locality-aware hashing of streams to buck-ets, assign buckets including time series of stocks to workers andprune the number of pairwise comparisons for time series that arenot hashed nearby. For that purpose, the SDE should be queriedin-between the
Window and
AggregativeOperation of Figure 4so as to get the bucketID per stock, i.e., the id of the worker wherethe
AggregativeOperation (pairwise similarity estimation) willbe performed independently.Our Discrete Fourier Transform (DFT)-based correlation estima-tion implementation is based on StatStream [34]. An importantobservation for assigning time series to buckets is that there is adirect relation between Pearson’s correlation coefficient (denoted
Corr below) among time series x , y and the Euclidean distance oftheir corresponding normalized version (we use primes to distin-guish DFT coefficients of normalized time series from the ones ofthe unnormalized version). In particular, Corr ( x , y ) = − d ( X ′ , Y ′ ) ,where d ( . ) is the Euclidean distance.The DFT transforms a sequence of n (potentially complex) num-bers x . . . , x n − into another sequence of complex numbers X , . . . , X n − , which is defined by the DFT coefficients, calculated as X F = n Í ( n − ) k = x k e i kFn , for F = , . . . , n − i = √− F in the above formulato few coefficients. There are a couple of additional properties ofthe DFT which are taken into consideration for parallelizing theprocessing load of pairwise comparisons among time series:(1) The Euclidean distance of the original time series and their DFTis preserved. We use this property to estimate the Euclideandistance of the original time series using their DFTs.(2) It holds that Corr ( x , y ) ≥ − ϵ ⇒ d ( X ′ , Y ′ ) ≤ ϵ . This says thatit is meaningful to examine only pairs of time series for which d ( X ′ , Y ′ ) ≤ ϵ . We use this property to bucketize (hash) timeseries based on the values of their first coefficient(s) and thenassign the load of pairwise comparisons within each bucket toworkers.The DFT coefficients can be updated incrementally upon operat-ing over sliding windows [34]. Let us now explain how the timeseries that are approximated by the DFT coefficients are bucke-tized so that possibly similar time series are hashed to the sameor neighboring buckets, while the rest are hashed to distant buck-ets and, therefore, they are never compared for similarity. Timeseries that are hashed to more than one buckets are replicated anequal amount of times.Now, assume a user-defined threshold T . According to our abovediscussion, in order for the correlation to be greater than T , then c) Owner 2019. This is the authors ’ version of the work. It is posted here for your personal use only. Not for redistribution. A Synopses Data Engine for Interactive Extreme-Scale AnalyticsConference’17, July 2017, Washington, DC, USA d ( X ′ , Y ′ ) needs to be lower than ϵ , with T = − ϵ . By using theDFT on normalized series, the original series are also mapped intoa bounded feature space. The norm (the size of the vector com-posed of the real and the imaginary part of the complex number)of each such coefficient is bounded by √ / −√ / √ /
2. Therefore, theDFT feature space is a cube of diameter √
2. Based on this, weuse a number of DFT coefficients to define a grid structure, com-posed of buckets for hashing groups of time series to each of them.Each bucket in the grid is of diameter ϵ and there are in total2 ⌈ √ ϵ ⌉ ( used _ coef f icients ) buckets. For instance, in [34] 16-40 DFTcoefficients are used to approximate stock exchange time series.Each time series is hashed to a specific bucket inside the grid.Suppose X ′ is hashed to a bucket. To detect the time series whosecorrelation with X ′ is above T , only time series hashed to the sameor adjacent buckets are possible candidates. Those time series are asuper-set of the true set of highly-correlated ones. Since the bucketdiameter is ϵ , time series mapped to non-adjacent buckets possessa Euclidean distance greater than ϵ , hence, their respective cor-relation is guaranteed to be lower than T . Moreover, due to thatproperty, there will be no similarity checks that are pruned whiletheir score would pass the threshold.Again, note that here the principal role of the SDEaaS is to pro-duce the corresponding DFT coefficients and hash time series tobuckets. That is why it should be queried between the Window andthe
AggregativeOperation . Therefore, the output of the corre-sponding synopsis in Table 1 includes the resulted coefficients andthe bucket identifier. The actual similarity tests (in each bucket)may be performed by the downstream operator (
AggregativeOperation )using the original time series. ...Synopsis-based Optimization for Enhanced Horizontal Scal-ability . When an application is willing to bargain accuracy for aconsiderable processing speed up or reduced memory consump-tion, the SDEaaS can act as the main tool of an advanced optimizerwhich would receive the application’s accuracy budget and rewritethe workflow to equivalent but approximate forms so as to achievethe aforementioned performance goals.Consider the workflow of Figure 4. Since CountMin sketchesare not preferable for correlation estimation [19] in our discussionwe are going to engage AMS sketches [12]. The key idea in AMSsketches is to represent a streaming (frequency) vector v using amuch smaller sketch vector sk ( v ) that is updated with the stream-ing tuples and provide probabilistic guarantees for the quality ofthe data approximation. The AMS sketch defines the i -th sketchentry for the vector v , sk ( v )[ i ] as the random variable Í k v [ k ] · ξ i [ k ] , where { ξ i } is a family of four-wise independent binary ran-dom variables uniformly distributed in {− , + } (with mutually-independent families across different entries of the sketch). Usingappropriate pseudo-random hash functions, each such family canbe efficiently constructed on-line in logarithmic space. Note that,by construction, each entry of sk ( v ) is essentially a randomizedlinear projection (i.e., an inner product) of the v vector (using thecorresponding ξ family), that can be easily maintained (using asimple counter) over the input update stream. Every time a newstream element arrives, v [ k ] · ξ i [ k ] is added to the aforementioned sum and similarly for element deletion. Each sketch vector can beviewed as a two-dimensional w × d array, where w = O ( / ϵ ) and d = O ( loд ( / δ )) , with ϵ , 1 − δ being the desired bounds on error andprobabilistic confidence, correspondingly. The inner product in thesketch-vector space and the L norms (in which case we replace sk ( v ) with sk ( v ) in the formula below and vice versa) is definedas: sk ( v ) · sk ( v ) = median | {z } j = .. d (cid:8) w Í wi = sk ( v )[ i , j ] · sk ( v )[ i , j ] (cid:9) .Some workflow execution plans that can be produced using ourSDEaaS functionality and an accuracy budget are:Plan1 The Count operator in Figure 4 can be rewritten to a
SDE.AMS (sketches) operator provided by our SDEaaS and then usethese sketches to judge pairwise similarities in
AggregativeOperation .Plan2 The
SDE.DFT synopsis can replace the
Window and
AggregativeOperation operators to: (i) bucketize time series com-parisons, (ii) speed up similarity tests by approximating orig-inal time series with few DFT coefficients.Plan3 Rewrite the
Count operator to
SDE.AMS and rewrite the
Window and
Aggre gativeOperation to SDE.DFT in which case theDFT operates on the sketched instead of the original timeseries.Based on which plans abide by the accuracy budget and on the timeand space complexity guarantees of each synopsis, the optimizercan pick the workflow execution plan that is expected to providethe higher throughput or lower memory usage. Again, horizontalscalability is enhanced compared to what the Big Data platformalone provides, by using the potential of synopses. ...for AQP & Federated Scalability . In the scope of ApproximateQuery Processing (AQP), the workflow of Figure 4 can take advan-tage of federated synopses that are supported by our SDEaaS ar-chitecture (Figure 2, Section 4.2) in order to reduce the amount ofdata that are communicated and thus enable federated scalability.For instance, assume Level 1, Level 2 data of stocks that are be-ing analyzed first arrive at sites (computer clusters each runningour SDEaaS) located at the various countries of the correspond-ing stock markets. Should one wish to pinpoint correlations ofstocks globally, a need to communicate the windowed time seriesof Figure 4 occurs. To ensure federated scalability to geo-dispersedsettings composed of many sites, few coefficients of
SDE.DFT or SDE.AMS sketches can be used to replace the
Window operator inFigure 4 and reduce the dimensionality of the time series. Hence,the communication cost while compressed time series are exchangedamong the sites is harnessed and network latencies are prevented. ...for Data Stream Mining . StreamKM++ [10] computes a weightedsample of a data stream, called the CoreSet of the data stream. Adata structure termed CoreSetTree is used to speed up the time nec-essary for sampling non-uniformly during CoreSet maintenance.After the CoreSet is extracted from the data stream, a weighted k -means algorithm is applied on the CoreSet to get the final clustersfor the original stream data. Due to space constraints here, pleaserefer to [10] for further details. In this case, the AggregativeOperation is to be replaced by
SDE.CoreSetTree and the
ExtractClusters operator is a weighted k -means that uses the CoreSets. onference’17, July 2017, Washington, DC, USA Antonis Kontaxakis, Nikos Giatrakos, and Antonios Deligiannakis CMHLLDFT T h r o u g hpu t ( t up l e s / s e c ) Throughput versus Number of Workers |500 Streams|x10 Ingestion RateNumber of Workers (a) Varying the Parallelism.
CMHLLDFT T h r o u g hpu t ( t up l e s / s e c ) Ingestion Rate (x)Throughput versus Ingestion Rate|500 Streams|4 Workers (b) Varying the Ingestion Rate.
50 500 5000
CMHLLDFT T h r o u g hpu t ( t up l e s / s e c ) Number of StreamsThroughput versus Number of Streams|x10 Ingestion Rate|4 Workers (c) Varying the Number of Streams. (cid:1004)(cid:853)(cid:1004)(cid:1005)(cid:1004)(cid:853)(cid:1005)(cid:1004)(cid:1005)(cid:853)(cid:1004)(cid:1004)(cid:1005)(cid:1004)(cid:853)(cid:1004)(cid:1004) (cid:1006) (cid:1007) (cid:1008) (cid:1009) (cid:1010) (cid:1011) (cid:1012) (cid:1013) (cid:1005)(cid:1004)(cid:69)(cid:381)(cid:18)(cid:68)(cid:1085)(cid:69)(cid:381)(cid:44)(cid:62)(cid:62)(cid:1085)(cid:69)(cid:381)(cid:24)(cid:38)(cid:100)(cid:18)(cid:68)(cid:1085)(cid:44)(cid:62)(cid:62)(cid:1085)(cid:24)(cid:38)(cid:100) (cid:100) (cid:381) (cid:410) (cid:258) (cid:367) (cid:3) (cid:18) (cid:381) (cid:373)(cid:373) (cid:437)(cid:374) (cid:349) (cid:272) (cid:258) (cid:410) (cid:349) (cid:381) (cid:374) (cid:3) (cid:18) (cid:381) (cid:400) (cid:410) (cid:3)(cid:3)(cid:3)(cid:3) (cid:894) (cid:39) (cid:271) (cid:455) (cid:410) (cid:286) (cid:400) (cid:882) (cid:62) (cid:381) (cid:336) (cid:3) (cid:94) (cid:272) (cid:258) (cid:367) (cid:286) (cid:895) (cid:18)(cid:381)(cid:373)(cid:373)(cid:437)(cid:374)(cid:349)(cid:272)(cid:258)(cid:410)(cid:349)(cid:381)(cid:374)(cid:3)(cid:18)(cid:381)(cid:400)(cid:410)(cid:3)(cid:448)(cid:286)(cid:396)(cid:400)(cid:437)(cid:400)(cid:3)(cid:69)(cid:437)(cid:373)(cid:271)(cid:286)(cid:396)(cid:3)(cid:381)(cid:296)(cid:3)(cid:94)(cid:349)(cid:410)(cid:286)(cid:400)(cid:878)(cid:454)(cid:1005) (cid:47)(cid:374)(cid:336)(cid:286)(cid:400)(cid:410)(cid:349)(cid:381)(cid:374)(cid:3)(cid:90)(cid:258)(cid:410)(cid:286)(cid:69)(cid:437)(cid:373)(cid:271)(cid:286)(cid:396) (cid:381)(cid:296)(cid:3)(cid:94)(cid:349)(cid:410)(cid:286)(cid:400) (d) Varying the Number of Sites.
Figure 5: SDEaaS Scalability Study.
To test the performance of our SDEaaS approach, we utilize a Kafkacluster with 3 Dell PowerEdge R320 Intel Xeon E5-2430 v2 2.50GHzmachines with 32GB RAM each and one Dell PowerEdge R310Quad Core Xeon X3440 2.53GHz machine with 16GB RAM. OurFlink cluster has 10 Dell PowerEdge R300 Quad Core Xeon X33232.5GHz machines with 8GB RAM each. We use a real dataset com-posed of ∼ ∼ In the experiments of this first set, we test the performance ofour SDEaaS approach alone. That is we purely measure its perfor-mance on maintaining various types of synopses operators, with-out placing these operators provided by the SDE as parts of a work-flow. In particular, we measure the throughput, expressed as thenumber of tuples being processed per time unit (second) and com-munication cost (Gbytes) among workers, while varying a numberof parameters involving horizontal ((i),(ii)), vertical (iii) and feder-ated (iv) scalability, respectively: (i) the parallelization degree [2-4-6-8-10], (ii) the update ingestion rate [1-2-5-10] times the Kafkaingestion rate (i.e., each tuple read from Kafka is cloned [1-2-5-10]times in memory to further increase the tuples to process), (iii) thenumber of summarized stocks (streams) [50-500-5000] and (iv) theGbytes communicated among workers for maintaining each exam-ined synopsis as a federated one. Note that this also represents thecommunication cost that would incur among equivalent number ofsites (computer clusters), instead of workers, each of which main-tains its own synopses. In each experiment of this set, we buildand maintain Discrete Fourier Transform (DFT–8 coefficients, 0.9threshold), HyperLogLog (HLL – 64 bits, m = ϵ = . , δ = . ϵ = . , δ = .
01) synopses eachof which, as discussed in Section 7, is destined to support differ-ent types of analytics related to correlation, distinct count and fre-quency estimation, respectively (Table 1). Since the CM and theAMS sketches exhibited very similar performance we only includeCM sketches in the graph to improve readability. All the above parameters were set after discussions with experts from the dataprovider and on the same ground, we use a time window of 5 min-utes.Figure 5(a) shows that increasing the number of Flink workerscauses proportional increase in throughput. This comes as no sur-prise, since for steady ingestion rate and constant number of mon-itored streams, increasing the parallelization degree causes fewerstreams to be processed per worker which in turn results in re-duced processing load for each of them. Figure 5(b), on the otherhand, shows that varying the ingestion rate from 1 to 10 causesthroughput to increase almost linearly as well. This is a key signof horizontal scalability, since the figure essentially says that thedata rates the SDEaaS can serve, quantified in terms of throughput,are equivalent to the increasing rates at which data arrive to it. Fig-ure 5(c) shows something similar as the throughput increases uponincreasing the number of processed streams from 50 to 5000. Thisvalidates our claim regarding the vertical scalability aspects theSDEaaS can bring in the workflows it participates. We further com-ment on such aspects in the comparative analysis in Section 8.2.Finally, Figure 5(d) illustrates the communication performanceof SDEaaS upon maintaining federated synopses and communicat-ing the results to a responsible site so as to derive the final estima-tions (see yellow arrows in Figure 2 and Section 4.2). For this ex-periment, we divide the streams among workers and each workerrepresents a site which analyzes its own stocks by computing CM,HLL, DFT synopses. A random site is set responsible for mergingpartial, local summaries and for providing the overall estimation,while we measure the total Gbytes that are communicated amongsites/workers as more sites along with their streams are taken intoconsideration. Note that the sites do not communicate all the time,but upon an
Ad-hoc Query request every 5 minutes.Here, the total communication cost for deriving estimations fromsynopses, is not a number that says much on its own. It is expectedof the communication cost will rise as more sites are added to thenetwork. The important factor to judge federated scalability is thecommunication cost when we use the synopses ("CM+HLL+DFT"line in Figure 5(d)) compared to when we do not. Therefore, inFigure 5(d), we also plot a line (labeled "NoCM+NoHLL+NoDFT")illustrating the communication cost that takes place upon answer-ing the same (cardinality, count, time series) queries without syn-opses. As Figure 5(d) illustrates (the vertical axis is in log scale),the communication gains steadily remain above an order of mag-nitude. c) Owner 2019. This is the authors ’ version of the work. It is posted here for your personal use only. Not for redistribution. A Synopses Data Engine for Interactive Extreme-Scale AnalyticsConference’17, July 2017, Washington, DC, USA (cid:1004)(cid:853)(cid:1004)(cid:1004)(cid:1006)(cid:853)(cid:1004)(cid:1004)(cid:1008)(cid:853)(cid:1004)(cid:1004)(cid:1010)(cid:853)(cid:1004)(cid:1004)(cid:1012)(cid:853)(cid:1004)(cid:1004)(cid:1005)(cid:1004)(cid:853)(cid:1004)(cid:1004)(cid:1005)(cid:1006)(cid:853)(cid:1004)(cid:1004) (cid:1009)(cid:1004) (cid:1009)(cid:1004)(cid:1004) (cid:1009)(cid:1004)(cid:1004)(cid:1004) (cid:94)(cid:24)(cid:28)(cid:258)(cid:258)(cid:94)(cid:894)(cid:24)(cid:38)(cid:100)(cid:1085)(cid:87)(cid:258)(cid:396)(cid:258)(cid:367)(cid:367)(cid:286)(cid:367)(cid:349)(cid:400)(cid:373)(cid:895)(cid:876)(cid:69)(cid:258)(cid:349)(cid:448)(cid:286)(cid:87)(cid:258)(cid:396)(cid:258)(cid:367)(cid:367)(cid:286)(cid:367)(cid:349)(cid:400)(cid:373)(cid:894)(cid:69)(cid:381)(cid:24)(cid:38)(cid:100)(cid:895)(cid:876)(cid:69)(cid:258)(cid:349)(cid:448)(cid:286)(cid:24)(cid:38)(cid:100)(cid:894)(cid:69)(cid:381)(cid:87)(cid:258)(cid:396)(cid:258)(cid:367)(cid:367)(cid:286)(cid:367)(cid:349)(cid:400)(cid:373)(cid:895)(cid:876)(cid:69)(cid:258)(cid:349)(cid:448)(cid:286) (cid:100) (cid:346) (cid:396) (cid:381) (cid:437) (cid:336) (cid:346)(cid:393)(cid:437) (cid:410) (cid:3) (cid:90) (cid:258) (cid:410) (cid:349) (cid:381) (cid:100)(cid:346)(cid:396)(cid:381)(cid:437)(cid:336)(cid:346)(cid:393)(cid:437)(cid:410)(cid:3)(cid:90)(cid:258)(cid:410)(cid:349)(cid:381)(cid:3)(cid:448)(cid:286)(cid:396)(cid:400)(cid:437)(cid:400)(cid:3)(cid:69)(cid:437)(cid:373)(cid:271)(cid:286)(cid:396)(cid:3)(cid:381)(cid:296)(cid:3)(cid:94)(cid:410)(cid:396)(cid:286)(cid:258)(cid:373)(cid:400)(cid:878)(cid:1008)(cid:3)(cid:116)(cid:381)(cid:396)(cid:364)(cid:286)(cid:396)(cid:400)(cid:878)(cid:454)(cid:1005) (cid:47)(cid:374)(cid:336)(cid:286)(cid:400)(cid:410)(cid:349)(cid:381)(cid:374)(cid:3)(cid:90)(cid:258)(cid:410)(cid:286) (cid:69)(cid:437)(cid:373)(cid:271)(cid:286)(cid:396) (cid:381)(cid:296)(cid:3)(cid:94)(cid:410)(cid:396)(cid:286)(cid:258)(cid:373)(cid:400)
Figure 6: Comparative Analysis in Executing the Workflowof Figure 4 using
SDE.DFT . We use the DFT synopsis to replace
Window , AggregativeOperation as discussed in Section 7, since the most computationally inten-sive (and thus candidate to become the bottleneck) operator inthe workflow of Figure 4 is the
AggregativeOperation whichperforms pairwise correlation estimations of time series. Indica-tively, when 5K stocks are monitored, the pairwise similarity com-parisons that need to be performed by naive approaches are 12.5M.In Figure 6 we measure the performance of our SDEaaS approachemployed in this work against three alternative approaches. Moreprecisely, the compared approaches are: • Naive : This is the baseline approach which involves sequentialprocessing of incoming tuples without parallelism or any synop-sis. • SDEaaS(DFT+Parallelism) : This is the approach employed inthis work which combines the virtues of parallel processing (us-ing 4 workers in Figure 6) and stream summarization (DFT syn-opsis) towards delivering interactive analytics at extreme scale. • Parallelism(NoDFT) : This approach performs parallel process-ing (4 workers), but does not utilize any synopses to bucketizetime series or reduce their dimensionality. Its performance corre-sponds to competitors such as SnappyData [32] which providean SDE, but their SDE is restricted to simple aggregates, thusneglecting synopses ensuring vertical scalability. Moreover, forthe same reason, it also represents the performance of synopsisutilities provided by Spark. • DFT(NoParallelism) : The DFT(NoParallelism) approach utilizesDFT synopses to bucketize time series and for dimensionalityreduction, but no parallelism is used for executing the work-flow of Figure 4. Pairwise similarity checks are restricted to ad-juscent buckets and thus comparisons can be pruned, but thecomputation of similarities is not performed in parallel for eachbucket. This approach corresponds to competitors such as DataS-ketch [9] or Stream-lib [8] which provide a synopses library butdo not include parallel implementations of the respective algo-rithms and do not follow an SDEaaS paradigm.Each line in the plot of Figure 6 measures the ratio of through-puts of each examined approach over the Naive approach vary-ing the amount of monitored stock streams. Let us first examineeach line individually. It is clear that when we monitor few tens ofstocks (50 in the figure), the use of DFT in the DFT(NoParallelism)marginally improves (1.5 times higher throughput) the throughputof the Naive approach. On the other hand, the Parallelism(NoDFT)improves the Naive by ∼ (cid:1004)(cid:853)(cid:1004)(cid:1004)(cid:1006)(cid:853)(cid:1004)(cid:1004)(cid:1008)(cid:853)(cid:1004)(cid:1004)(cid:1010)(cid:853)(cid:1004)(cid:1004)(cid:1012)(cid:853)(cid:1004)(cid:1004)(cid:1005)(cid:1004)(cid:853)(cid:1004)(cid:1004)(cid:1005)(cid:1006)(cid:853)(cid:1004)(cid:1004) (cid:1009)(cid:1004) (cid:1009)(cid:1004)(cid:1004) (cid:1009)(cid:1004)(cid:1004)(cid:1004) (cid:94)(cid:24)(cid:28)(cid:258)(cid:258)(cid:94)(cid:894)(cid:18)(cid:381)(cid:396)(cid:286)(cid:94)(cid:286)(cid:410)(cid:100)(cid:396)(cid:286)(cid:286)(cid:1085)(cid:87)(cid:258)(cid:396)(cid:258)(cid:367)(cid:367)(cid:286)(cid:367)(cid:349)(cid:400)(cid:373)(cid:895)(cid:876)(cid:69)(cid:258)(cid:349)(cid:448)(cid:286)(cid:87)(cid:258)(cid:396)(cid:258)(cid:367)(cid:367)(cid:286)(cid:367)(cid:349)(cid:400)(cid:373)(cid:894)(cid:69)(cid:381)(cid:18)(cid:381)(cid:396)(cid:286)(cid:94)(cid:286)(cid:410)(cid:100)(cid:396)(cid:286)(cid:286)(cid:895)(cid:876)(cid:69)(cid:258)(cid:349)(cid:448)(cid:286)(cid:18)(cid:381)(cid:396)(cid:286)(cid:94)(cid:286)(cid:410)(cid:100)(cid:396)(cid:286)(cid:286)(cid:894)(cid:69)(cid:381)(cid:87)(cid:258)(cid:396)(cid:258)(cid:367)(cid:367)(cid:286)(cid:367)(cid:349)(cid:400)(cid:373)(cid:895)(cid:876)(cid:69)(cid:258)(cid:349)(cid:448)(cid:286) (cid:100) (cid:346) (cid:396) (cid:381) (cid:437) (cid:336) (cid:346)(cid:393)(cid:437) (cid:410) (cid:3) (cid:90) (cid:258) (cid:410) (cid:349) (cid:381) (cid:100)(cid:346)(cid:396)(cid:381)(cid:437)(cid:336)(cid:346)(cid:393)(cid:437)(cid:410)(cid:3)(cid:90)(cid:258)(cid:410)(cid:349)(cid:381)(cid:3)(cid:448)(cid:286)(cid:396)(cid:400)(cid:437)(cid:400)(cid:3)(cid:69)(cid:437)(cid:373)(cid:271)(cid:286)(cid:396)(cid:3)(cid:381)(cid:296)(cid:3)(cid:94)(cid:410)(cid:396)(cid:286)(cid:258)(cid:373)(cid:400)(cid:878)(cid:1008)(cid:3)(cid:116)(cid:381)(cid:396)(cid:364)(cid:286)(cid:396)(cid:400)(cid:878)(cid:454)(cid:1005) (cid:47)(cid:374)(cid:336)(cid:286)(cid:400)(cid:410)(cid:349)(cid:381)(cid:374)(cid:3)(cid:90)(cid:258)(cid:410)(cid:286) (cid:69)(cid:437)(cid:373)(cid:271)(cid:286)(cid:396) (cid:381)(cid:296)(cid:3)(cid:94)(cid:410)(cid:396)(cid:286)(cid:258)(cid:373)(cid:400) Figure 7: Comparative Analysis in Executing the Workflowof Figure 4 using
SDE.CoreSetTree . taking advantage of both the synopsis and parallelism improvesthe Naive by almost 4 times. Note that when 50 streams are moni-tored, the number of performed pair-wise similarity checks in theworkflow of Figure 4 for the Naive approach is 2.5K/2.This is important because, according to Figure 6, when we switchto monitoring 500 streams, i.e., 250K/2 similarity checks are per-formed by Naive, the fact that the Parallelism(NoDFT) approachlacks the ability of the DFT to bucketize time series and prune un-necessary similarity checks, makes its throughput approaching theNaive approach. This is due to AggregativeOperation startingto become a computational bottleneck for Parallelism(NoDFT) inthe workflow of Figure 4. On the contrary, the DFT(NoParallelism)line remains steady when switching from 50 to 500 streams. TheDFT(NoParallelism) approach starts to perform better than Par-allelism(NoDFT) on 500 monitored streams showing that the im-portance of comparison pruning and, thus, of vertical scalabilityis higher than the importance of parallelism, as more streams aremonitored. The line corresponding to our SDEaaS(DFT+Parallelism)approach exhibits steady behavior upon switching from 50 to 500,improving the Naive approach by 4 times, the DFT(NoParallelism)approach by 3 and the Parallelism(NoDFT) approach by 3.5 times.The most important findings come upon switching to monitor-ing 5000 stocks (25M/2 similarity checks using Naive or Parallelism(NoDFT)).Figure 6 says that because of the lack of the vertical scalabilityprovided by the DFT, the Parallelism(NoDFT) approach becomesequivalent to the Naive one. The DFT(NoParallelism) approach im-proves the throughput of the Naive and of Parallelism (NoDFT) by7 times. Our SDEaaS(DFT+Parallelism) exhibits 11.5 times betterperformance compared to Naive, Parallelism(NoDFT) and almostdoubles the performance of DFT(NoParallelism). This validates thepotential of SDEaaS(DFT+Parallelism) to support interactive ana-lytics upon judging similarities of millions of pairs of stocks. In ad-dition, studying the difference between DFT(NoParalleli-sm) andSDEaaS(DFT+Parallelism) we can quantify which part of the im-provement over Naive, Parallelism(NoDFT) is caused due to com-parison pruning based on time series bucketization and which partis yielded by parallelism. That is, the use of DFT for bucketiza-tion and dimensionality reduction increases throughput by 7 times(equivalent to the performance of DFT(NoParallelism)), while theadditional improvement entailed by SDEaaS(DFT+Parallelism) isroughly equivalent to the number of workers (4 workers in Fig-ure 6). This indicates the success of SDEaaS in integrating the virtuesof data synopsis and parallel processing. onference’17, July 2017, Washington, DC, USA Antonis Kontaxakis, Nikos Giatrakos, and Antonios Deligiannakis
We then perform a similar experiment for the stream miningversion of the workflow in Figure 4 as described in Section 7. Inparticular, in this experiment the Naive approach corresponds toStreamKM++ clustering without parallelism and coreset sizes equiv-alent to the original data points (time series). The Parallelism(NoCoreSetTree)approach involves performing StreamKM++ with coreset sizes equiv-alent to the original data points, but exploiting parallelism. TheCoreSetTree(NoParallelism) exploits the CoreSetTree synopsis butuses no parallelism, while SDEaaS(CoreSetTree +Parallelism) com-bines the two. For CoreSetTree(NoParallelism) and SDEaaS(CoreSetTree+Parallelism),we use bucket sizes of 10-100-400 and k values are set to 4 − −
40, for 50-500-5000 streams, correspondingly. The conclusions thatcan be drawn from Figure 7 are very similar with what we dis-cussed in Figure 6. However, the respective ratios of throughputover the Naive approach are lower (2-3 times higher throughputthan the second best candidate in Figure 7). This is by design ofthe mining algorithm and the reason is that the clustering pro-cedure includes a reduction step which is performed by a singleworker. This is in contrast with the
ApplyThreshold operation inFigure 6 which can be performed by different processing units in-dependently.
In Section 6 we argued about the fact that employing a non-SDEaaSapproach, as works such as [7] do, restricts the maximum allowednumber of concurrently maintained synopses up to the availabletask slots. That is, if the SDE is not provided as a service using ournovel architecture, in case we want to maintain a new synopsiswhen a demand arises (without ceasing the currently maintainedones, because these may already serve workflows as the one inFigure 4), we have to submit a new job. A job occupies at leastone task slot. On the contrary, in our SDEaaS approach, when a re-quest for a new synopsis arrives on-the-fly, we simply devote moretasks (which can exploit hyper-threading, pseudo-parallelism etc)instead of entire task slots. Because of that, our SDEaaS design is amuch more preferable choice since it can simultaneously maintainthousands of synopses for thousands of streams. (cid:1004)(cid:853)(cid:1004)(cid:1004)(cid:28)(cid:1085)(cid:1004)(cid:1004)(cid:1009)(cid:853)(cid:1004)(cid:1004)(cid:28)(cid:1085)(cid:1004)(cid:1009)(cid:1005)(cid:853)(cid:1004)(cid:1004)(cid:28)(cid:1085)(cid:1004)(cid:1010)(cid:1005)(cid:853)(cid:1009)(cid:1004)(cid:28)(cid:1085)(cid:1004)(cid:1010)(cid:1006)(cid:853)(cid:1004)(cid:1004)(cid:28)(cid:1085)(cid:1004)(cid:1010)(cid:1006)(cid:853)(cid:1009)(cid:1004)(cid:28)(cid:1085)(cid:1004)(cid:1010)(cid:1007)(cid:853)(cid:1004)(cid:1004)(cid:28)(cid:1085)(cid:1004)(cid:1010) (cid:1005) (cid:1005)(cid:1004) (cid:1005)(cid:1004)(cid:1004) (cid:1005)(cid:1004)(cid:1004)(cid:1004) (cid:1005)(cid:1004)(cid:1004)(cid:1004)(cid:1004) (cid:94)(cid:24)(cid:28)(cid:258)(cid:258)(cid:94)(cid:374)(cid:381)(cid:374)(cid:882)(cid:94)(cid:24)(cid:28)(cid:258)(cid:258)(cid:94) (cid:100) (cid:381) (cid:410) (cid:258) (cid:367) (cid:3) (cid:100) (cid:346) (cid:396) (cid:381) (cid:437) (cid:336) (cid:346)(cid:393)(cid:437) (cid:410) (cid:3) (cid:894) (cid:410) (cid:437)(cid:393) (cid:367) (cid:286) (cid:400) (cid:876) (cid:400) (cid:286) (cid:272) (cid:895) (cid:100)(cid:346)(cid:396)(cid:381)(cid:437)(cid:336)(cid:346)(cid:393)(cid:437)(cid:410)(cid:3)(cid:448)(cid:286)(cid:396)(cid:400)(cid:437)(cid:400)(cid:3)(cid:69)(cid:437)(cid:373)(cid:271)(cid:286)(cid:396)(cid:3)(cid:381)(cid:296)(cid:3)(cid:94)(cid:455)(cid:374)(cid:381)(cid:393)(cid:400)(cid:286)(cid:400)(cid:454)(cid:1005)(cid:1004) (cid:47)(cid:374)(cid:336)(cid:286)(cid:400)(cid:410)(cid:349)(cid:381)(cid:374)(cid:3)(cid:90)(cid:258)(cid:410)(cid:286) (cid:69)(cid:437)(cid:373)(cid:271)(cid:286)(cid:396) (cid:381)(cid:296)(cid:3)(cid:18)(cid:381)(cid:374)(cid:272)(cid:437)(cid:396)(cid:396)(cid:286)(cid:374)(cid:410)(cid:367)(cid:455)(cid:3)(cid:68)(cid:258)(cid:349)(cid:374)(cid:410)(cid:258)(cid:349)(cid:374)(cid:286)(cid:282)(cid:3)(cid:94)(cid:455)(cid:374)(cid:381)(cid:393)(cid:400)(cid:286)(cid:400)(cid:853)
Figure 8: Comparison of SDAaaS vs non-SDEaaS. ✘ signs de-note that non-SDEaaS cannot maintain more than 40 syn-opses simultaneously since available task slots are depleted. To show the superiority of our approach in practice, we designan experiment where we start with maintaining 2 CM sketchesfor frequency estimations on the volume, price pairs of each stock.Note that this differs compared to what we did in Figure 5 wherewe kept a CM sketch for estimating the count of bids per stock in the whole dataset. Then, we express demands for maintaining onemore CM sketch for up to 5000 sketches/stocks. We do that with-out stopping the already running synopses each time. We measurethe sum of throughputs of all running jobs for the non-SDEaaSapproach and the throughput of our SDE and plot the results inFigure 8.First, it can be observed that we cannot maintain more than 40synopses simultaneously using the non-SDEaaS approach since wedeplete the available task slots. This is denoted with ✘ signs in theplot. Second, even when up to 40 synopses are concurrently main-tained, our SDEaaS approach always performs better compared tothe non-SDEaaS alternative. This is because slot sharing in SDEaaSmeans that more than one task is scheduled into the same slot, orin other words, CM sketches end up sharing resources. The mainbenefit of this is better resource utilization. In the non-SDEaaS ap-proach if there is skew in the update rate of a number of streams (towhich one task slot per synopsis per stream is alloted), we mighteasily end up with some slots doing very little work at certain inter-vals, while others are quite busy. This is avoided in SDEaaS due toslot sharing. Therefore, better resource utilization is an additionaladvantage of our SDEaaS approach. In this work we introduced a Synopses Data Engine (SDE) for en-abling interactive analytics over voluminous, high-speed data streams.Our SDE is implemented following a SDE-as-a-Service (SDEaaS)paradigm and is materialized via a novel architecture. It is easily ex-tensible, customizable with new synopses and capable of providingvarious types of scalability. Moreover, we exhibited ways in whichSDEaaS can serve workflows for different purposes and we com-mented on implementation insights and lessons learned through-out this endeavor. Our future work focuses on (a) enriching theSDE Library with more synopsis techniques [17], (b) integrate itwith machine learning components such as [7], (c) implement theproposed SDEaaS architecture directly on the data ingestion layervia Kafka Streams which lacks facilities like
CoFlatMap , (d) simi-larly for Apache Beam [1], to make the service directly runnableto a variety of Big Data platforms.
REFERENCES
ACMJournal of Experimental Algorithmics
17, 1 (2012).[11] P. K. Agarwal, G. Cormode, Z. Huang, J. Phillips, Z. Wei, and K. Yi. 2012. Merge-able Summaries. In
PODS .[12] N. Alon, Y. Matias, and M. Szegedy. 1996. The Space Complexity of Approximat-ing the Frequency Moments. In
STOC .[13] B. Babcock, M. Datar, and R. Motwani. 2002. Sampling from a moving windowover streaming data. In
SODA .[14] B. H. Bloom. 1970. Space/Time Trade-offs in Hash Coding with Allowable Errors.
Commun. ACM
13, 7 (1970), 422–426. c) Owner 2019. This is the authors ’ version of the work. It is posted here for your personal use only. Not for redistribution. A Synopses Data Engine for Interactive Extreme-Scale AnalyticsConference’17, July 2017, Washington, DC, USA [15] M. Charikar. 2002. Similarity estimation techniques from rounding algorithms.In
STOC .[16] S. Chintapalli, D. Dagit, B. Evans, and et al. 2016. Benchmarking StreamingComputation Engines: Storm, Flink and Spark Streaming. In
IPDPS Workshops .[17] G. Cormode, M. Garofalakis, P. Haas, and C. Jermaine. 2012. Synopses for Mas-sive Data: Samples, Histograms, Wavelets, Sketches.
Foundations and Trends inDatabases
4, 1-3 (2012), 1–294.[18] G. Cormode and M. N. Garofalakis. 2008. Approximate continuous queryingover distributed streams.
ACM Trans. Database Syst.
33, 2 (2008), 9:1–9:39.[19] G. Cormode and S. Muthukrishnan. 2005. An improved data stream summary:the count-min sketch and its applications.
J. Algorithms
55, 1 (2005), 58–75.[20] G. Cormode and K. Yi. 2020 (to be published).
Small Summaries for Big Data .Cambridge University Press.[21] P. Flajolet, E. Fusy, O. Gandouet, and et al. 2007. Hyperloglog: The analysis of anear-optimal cardinality estimation algorithm. In
AOFA .[22] P. Flajolet and G. N. Martin. 1985. Probabilistic Counting Algorithms for DataBase Applications.
J. Comput. Syst. Sci.
31, 2 (1985), 182–209.[23] M. Garofalakis, J. Gehrke, and R. Rastogi. [n.d.]. Data Stream Management: ABrave New World. In
Data Stream Management - Processing High-Speed DataStreams . Springer.[24] N. Giatrakos, E. Alevizos, A. Artikis, A. Deligiannakis, and M. N. Garofalakis.2020. Complex event recognition in the Big Data era: a survey.
VLDB J.
29, 1 (2020), 313–352.[25] N. Giatrakos, N. Katzouris, and A. Deligiannakis et al. 2019. Interactive Extreme:Scale Analytics Towards Battling Cancer.
IEEE Technol. Soc. Mag.
38, 2 (2019),54–61.[26] N. Giatrakos, Y. Kotidis, A. Deligiannakis, V. Vassalos, and Y. Theodoridis. 2013.In-network approximate computation of outliers with quality guarantees.
Inf.Syst.
38, 8 (2013), 1285–1308.[27] M. Greenwald and S. Khanna. 2001. Space-Efficient Online Computation ofQuantile Summaries. In
SIGMOD .[28] Z. Kaoudi and J.A.Q.Ruiz. 2018. Cross-Platform Data Processing: Use Cases andChallenges. In
ICDE .[29] J. Karimov, T. Rabl, A. Katsifodimos, R. Samarev, H. Heiskanen, and V. Markl.2018. Benchmarking Distributed Stream Data Processing Systems. In
ICDE .[30] G. S. Manku and R. Motwani. 2002. Approximate Frequency Counts over DataStreams. In
VLDB .[31] A. Milios, K. Bereta, K. Chatzikokolakis, D. Zissis, and S. Matwin. 2019. Auto-matic Fusion of Satellite Imagery and AIS Data for Vessel Detection. In
Fusion .[32] B. Mozafari. 2019. SnappyData. In
Encyclopedia of Big Data Technologies .[33] E. Zeitler and T. Risch. 2011. MassiveScale-out of Expensive Continuous Queries.
PVLDB
4, 11 (2011).[34] Y. Zhu and D. E. Shasha. 2002. StatStream: Statistical Monitoring of Thousandsof Data Streams in Real Time. In