[PDF] A Programming Model for Hybrid Workflows: combining Task-based Workflows and Dataflows all-in-one

Abstract

This paper tries to reduce the effort of learning, deploying, and integrating several frameworks for the development of e-Science applications that combine simulations with High-Performance Data Analytics (HPDA). We propose a way to extend task-based management systems to support continuous input and output data to enable the combination of task-based workflows and dataflows (Hybrid Workflows from now on) using a single programming model. Hence, developers can build complex Data Science workflows with different approaches depending on the requirements. To illustrate the capabilities of Hybrid Workflows, we have built a Distributed Stream Library and a fully functional prototype extending COMPSs, a mature, general-purpose, task-based, parallel programming model. The library can be easily integrated with existing task-based frameworks to provide support for dataflows. Also, it provides a homogeneous, generic, and simple representation of object and file streams in both Java and Python; enabling complex workflows to handle any data type without dealing directly with the streaming back-end.

Full PDF

TThis manuscript has been accepted at the Future Generation Computer Sys-tems (FGCS).DOI: 10.1016/j.future.2020.07.007This manuscript is licensed under CC-BY-NC-ND.1 a r X i v : . [ c s . D C ] J u l Programming Model for Hybrid Workﬂows:combining Task-based Workﬂows and Dataﬂowsall-in-one

Cristian Ramon-Cortes a , Francesc Lordan a , Jorge Ejarque a , Rosa M. Badia a a Barcelona Supercomputing Center (BSC)

Abstract

In the past years, e-Science applications have evolved from large-scale sim-ulations executed in a single cluster to more complex workﬂows where thesesimulations are combined with High-Performance Data Analytics (HPDA). Toimplement these workﬂows, developers are currently using diﬀerent patterns;mainly task-based and dataﬂow. However, since these patterns are usually man-aged by separated frameworks, the implementation of these applications requiresto combine them; considerably increasing the eﬀort for learning, deploying, andintegrating applications in the diﬀerent frameworks.This paper tries to reduce this eﬀort by proposing a way to extend task-based management systems to support continuous input and output data toenable the combination of task-based workﬂows and dataﬂows (Hybrid Work-ﬂows from now on) using a single programming model. Hence, developers canbuild complex Data Science workﬂows with diﬀerent approaches depending onthe requirements. To illustrate the capabilities of Hybrid Workﬂows, we havebuilt a Distributed Stream Library and a fully functional prototype extendingCOMPSs, a mature, general-purpose, task-based, parallel programming model.The library can be easily integrated with existing task-based frameworks toprovide support for dataﬂows. Also, it provides a homogeneous, generic, andsimple representation of object and ﬁle streams in both Java and Python; en-abling complex workﬂows to handle any data type without dealing directly withthe streaming back-end.

Preprint submitted to Elsevier July 10, 2020 uring the evaluation, we introduce four use cases to illustrate the newcapabilities of Hybrid Workﬂows; measuring the performance beneﬁts whenprocessing data continuously as it is generated, when removing synchronisationpoints, when processing external real-time data, and when combining task-basedworkﬂows and dataﬂows at diﬀerent levels. The users identifying these patternsin their workﬂows may use the presented uses cases (and their performanceimprovements) as a reference to update their code and beneﬁt of the capabilitiesof Hybrid Workﬂows. Furthermore, we analyse the scalability in terms of thenumber of writers and readers and measure the task analysis, task scheduling,and task execution times when using objects or streams.

1. Introduction

For many years, large-scale simulations, High-Performance Data Analytics(HPDA), and simulation workﬂows have become a must to progress in manyscientiﬁc areas such as life, health, and earth sciences. In such a context, there isa need to adapt the High-Performance infrastructure and frameworks to supportthe needs and challenges of workﬂows combining these technologies [1].Traditionally, developers have tackled the parallelisation and distributed ex-ecution of these applications following two diﬀerent strategies. On the one hand,task-based workﬂows orchestrate the execution of several pieces of code ( tasks )that process and generate data values. These tasks have no state and, during itsexecution, they are isolated from other tasks; thus, task-based workﬂows consistof deﬁning the data dependencies among tasks. On the other hand, dataﬂowsassume that tasks are persistent executions with a state that continuously re-ceive/produce data values (streams). Through dataﬂows, developers describehow the tasks communicate to each other.Regardless of the workﬂow type, directed graphs are a useful visualisationand management tool. Figure 1 shows the graph representation of a task-basedworkﬂow (left) and its equivalent dataﬂow (right). The task dependency graphconsists of a producer task (coloured in pink) and ﬁve consumer tasks (coloured3n blue) that can run in parallel after the producer completes. The dataﬂowgraph also has a producer task (coloured in pink), but one single stateful con-sumer task (coloured in blue) which processes all the input data sequentially(unless the developer internally parallelises it). Rather than waiting for thecompletion of the producer task to process all its outputs, the consumer taskcan process the data as it is generated.

Figure 1: Representation of the same workﬂow using task-based and dataﬂow patterns (leftand right respectively).

The ﬁrst contribution of this paper is the proposal of a single hybrid pro-gramming model capable of executing task-based workﬂows and dataﬂows si-multaneously. For that purpose, we extend task-based frameworks to supportcontinuous input and output data, and enable the combination of task-basedworkﬂows and dataﬂows (Hybrid Workﬂows from now on) using the same pro-gramming model. Moreover, it allows developers to build complex Data Sciencepipelines with diﬀerent approaches depending on the requirements. The evalu-ation (Section 6) demonstrates that the use of Hybrid Workﬂows has signiﬁcantperformance beneﬁts when identifying some patterns in task-based workﬂows;e.g., when processing data continuously as it is generated, when removing syn-chronisation points, when processing external real-time data, and when com-bining task-based workﬂows and dataﬂows at diﬀerent levels. Also, notice thatusing a single programming model frees the developers from the burden of de-ploying, using, and communicating diﬀerent frameworks inside the same work-ﬂow.The second contribution presented in this paper is a Distributed StreamingLibrary that can be easily integrated with existing task-based frameworks to4rovide support for dataﬂows. The library provides a homogeneous, generic, andsimple representation of a stream for enabling complex workﬂows to handle anykind of data without dealing directly with the streaming back-end. At its currentstate, the library supports ﬁle streams through a custom implementation, andobject streams through Kafka [2].To validate and evaluate our proposal, we have built a prototype that helpsus to illustrate the additional capabilities of Hybrid Workﬂows for complexData Science applications. This prototype extends COMPSs [3, 4], a mature,general-purpose programming model, and integrates the Distributed StreamingLibrary.The rest of the paper is organised as follows. Section 2 presents an overviewof the related work, and Section 3 introduces the baseline technology on whichour solution is built. Next, Section 4 describes the architecture of the Dis-tributed Stream Library and its integration with COMPSs. Section 5 detailsfour use cases to illustrate the new available features, and Section 6 evaluatesour proposal measuring performance improvements at application level and theadditional runtime overhead. Finally, Section 7 concludes the paper and givessome guidelines for future work.

2. Related Work

Nowadays, state-of-the-art frameworks typically focus on the execution ofeither task-based workﬂows or dataﬂows. Thus, next subsections provide ageneral overview of the most relevant frameworks for both task-based workﬂowsand dataﬂows. Furthermore, since our prototype combines both approaches intoa single programming model and allows developers to build Hybrid Workﬂowswithout deploying and managing two diﬀerent frameworks, the last subsectiondetails other solutions and compares them with our proposal.

Although all the frameworks handle the tasks and data transfers transpar-ently, there are two main approaches to deﬁne task-based workﬂows. On the5ne hand, many frameworks force developers to explicitly deﬁne the applicationworkﬂow through a recipe ﬁle or a graphical interface. FireWorks [5, 6] deﬁnescomplex workﬂows using recipe ﬁles in Python, JSON, or YAML. It focuseson high-throughput applications, such as computational chemistry and mate-rials science calculations, and provides support arbitrary computing resources(including queue systems), monitoring through a built-in web interface, failuredetection, and dynamic workﬂow management. Taverna [7, 8] is a suite of toolsto design, monitor, and execute scientiﬁc workﬂows. It provides a graphical userinterface for the composition of workﬂows that are written in a Simple Concep-tual Uniﬁed Flow Language (Scuﬂ) and executed remotely by the Taverna Serverto any underlying infrastructure (such as supercomputers, Grids or cloud en-vironments). Similarly, Kepler [9, 10] also provides a graphical user interfaceto compose workﬂows by selecting and connecting analytical components anddata sources. Furthermore, workﬂows can be easily stored, reused, and sharedacross the community. Internally, Kepler’s architecture is actor-oriented to al-low diﬀerent execution models into the same workﬂow. Also, Galaxy [11, 12] is aweb-based platform for data analysis focused on accessibility and reproducibil-ity of workﬂows across the scientiﬁc community. The users deﬁne workﬂowsthrough the web portal and submit their executions to a Galaxy server contain-ing a full repertoire of tools and reference data. In an attempt to increase theinteroperability between the diﬀerent systems and to avoid the duplication ofdevelopment eﬀorts, Tavaxy [13] integrates Taverna and Galaxy workﬂows ina single environment; deﬁning an extensible set of re-usable workﬂow patternsand supporting cloud capabilities. Although Tavaxy allows the composition ofworkﬂows using Taverna and Galaxy sub-workﬂows, the resulting workﬂow doesnot support streams nor any dataﬂow pattern.On the other hand, other frameworks implicitly build the task dependencygraph from the user code. Some opt for deﬁning a new scripting languageto manage the workﬂow. These solutions force the users to learn a new lan-guage but make a clear diﬀerentiation between the workﬂow’s management (thescript) and the processes or programs to be executed. Nextﬂow [14, 15] enables6calable and reproducible workﬂows using software containers. It provides aﬂuent DSL to implement and deploy workﬂows but allows the adaptation ofpipelines written in the most common scripting languages. Swift [16, 17] is aparallel scripting language developed in Java and designed to express and coor-dinate parallel invocations of application programs on distributed and parallelcomputing platforms. Users only deﬁne the main application and the input andoutput parameters of each program, so that Swift can execute the application inany distributed infrastructure by automatically building the data dependencies.Other frameworks opt for deﬁning some annotations on top of an alreadyexisting language. These solutions avoid the users from learning a new lan-guage but merge the workﬂow annotations and its execution in the same ﬁles.Parsl [18] evolves from Swift and provides an intuitive way to build implicitworkﬂows by annotating ”apps” in Python codes. In Parsl, the developersannotate Python functions (apps) and Parsl constructs a dynamic, parallel exe-cution graph derived from the implicit linkage between apps based on shared in-put/output data objects. Parsl then executes apps when dependencies are met.Parsl is resource-independent, that is, the same Parsl script can be executed ona laptop, cluster, cloud, or supercomputer. Dask [19] is a library for parallelcomputing in Python. Dask follows a task-based approach being able to takeinto account the data-dependencies between the tasks and exploiting the inher-ent concurrency. Dask has been designed for computation and interactive datascience and integration with Jupyter notebooks. It is built on the dataframedata-structure that oﬀers interfaces to NumPy, Pandas, and Python iterators.Dask supports implicit, simple, task-graphs previously deﬁned by the system(Dask Array or Dask Bag) and, for more complex graphs, the programmer canrely in the delayed annotation that supports the asynchronous executions oftasks by building the corresponding task-graph. COMPSs [3, 20, 4] is a task-based programming model for the development of workﬂows/applications to beexecuted in distributed programming platforms. The task-dependency graph(or workﬂow) is generated at execution time and depends on the input data andthe dynamic execution of the application. Thus, compared with other workﬂow7ystems that are based on the static drawing of the workﬂow, COMPSs oﬀers atool for building dynamic workﬂows, with all the ﬂexibility and expressivity ofthe programming language.

Stream processing has become an increasingly prevalent solution to pro-cess social media and sensor devices data. On the one hand, many frame-works have been created explicitly to face this problem. Apache Flink [21] isstreaming dataﬂow engine to perform stateful computations over data streams(i.e., event-driven applications, streaming pipelines or stream analytics). Itprovides exactly-once processing, high throughput, automated memory man-agement, and advanced streaming capabilities (such as windowing). Flink usersbuild dataﬂows that start with one or more input streams (sources), performarbitrary transformations, and end in one or more outputs (sinks). ApacheSamza [22] allows building stateful applications for event processing or real-timeanalytics. Its diﬀerential point is to oﬀer built-in support to process and trans-form data from many sources, including Apache Kafka, AWS Kinesis, AzureEventHubs, ElasticSearch, and HDFS. Samza users deﬁne a stream applicationthat processes messages from a set of input streams, transforms them by chain-ing multiple operators, and emits the results to output streams or stores. Also,Samza supports at-least-once processing, guaranteeing no data-loss even in caseof failures. Apache Storm [23] a is distributed real-time computation systembased on the master-worker architecture and used in real-time analytics, on-line machine learning, continuous computation, and distributed RPC, betweenothers. Storm users deﬁne topologies that consume streams of data (spouts)and process those streams in arbitrarily complex ways (bolts), re-partitioningthe streams between each stage of the computation however needed. AlthoughStorm natively provides at-least-once processing, it also supports exactly-onceprocessing via its high-level API called Trident. Twitter Heron [24] is a real-time, fault-tolerant stream processing engine. Heron was built as the Storm’ssuccessor, meaning that the topology concepts (spouts and bolts) are the same8nd has a compatible API with Storm. However, Heron provides better re-source isolation, new scheduler features (such as on-demand resources), betterthroughput, and lower latency.

Apache Spark [25] is a general framework for big data processing that wasoriginally designed to overcome the limitations of MapReduce [26]. Amongthe many built-in modules, Spark Streaming [27] is an extension of the Sparkcore to evolve from batch processing to continuous processing by emulatingstreaming via micro-batching. It ingests input data streams from many sources(e.g., Kafka, Flume, Kinesis, ZeroMQ) and divides them into batches that arethen processed by the Spark engine; allowing to combine streaming with batchqueries. Internally, the continuous stream of data is represented as a sequenceof RDDs in a high-level abstraction called Discretized Stream (DStream).Notice that Spark is based on high-level operators (operators on RDDs) thatare internally represented as a DAG; limiting the patterns of the applications.In contrast, our approach is based on sequential programming, which allowsthe developer to build any kind of application. Furthermore, micro-batching re-quires a predeﬁned threshold or frequency before any processing occurs; whichcan be ”real-time” enough for many applications, but may lead to failures whenmicro-batching is simply not fast enough. In contrast, our solution uses a ded-icated streaming engine to handle dataﬂows; relying on streaming technologiesrather than micro-batching and ensuring that the data is processed as soon asit is available.On the other hand, other solutions combine existing frameworks to supportHybrid Workﬂows. Asterism [28] is a hybrid framework combining dispel4py and Pegasus at diﬀerent levels to run data-intensive stream-based applicationsacross platforms on heterogeneous systems. The main idea is to represent thediﬀerent parts of a complex application as dispel4py workﬂows which are, then,orchestrated by Pegasus as tasks. While the stream-based execution is managedby dispel4py , the data movement between the diﬀerent execution platforms9nd the workﬂow engine (submit host) is managed by Pegasus. Notice that As-terism can only handle dataﬂows inside task-based workﬂows ( dispel4py work-ﬂows represented as Pegasus’ tasks), while our proposal is capable of orchestrat-ing nested task-ﬂows, nested dataﬂows, dataﬂows inside task-based workﬂows,and task-based workﬂows inside dataﬂows.

3. Background

To the end of enabling the construction of Hybrid Workﬂows using the sameprogramming model, we decided to extend an already existing task-based pro-gramming model and enable the support for dataﬂows. Since we have chosenCOMPSs as the base workﬂow manager for our prototype, this section intro-duces the essential concepts for understanding its programming model and sup-porting runtime. The second part of this section brieﬂy introduces Kafka sincethe default backend of the Distributed Streaming Library uses it to supportobject streams.

COMP Superscalar (COMPSs) is a programming model based on sequentialprogramming and designed to abstract developers away from the parallelisationand distribution details such as thread creation, synchronisation, data distri-bution, message passing or fault-tolerance. COMPSs is a task-based model;application developers select a set of methods whose invocations are consideredtasks that will run asynchronously in distributed nodes.As shown in Figure 2, Java is the native programming language to de-velop COMPSs applications; however, it also provides bindings for Python (Py-COMPSs [29]) and C/C++ [30]. Its programming model is based on annotationsthat are used to choose class and object methods as tasks. These annotationscan be split into two groups: • Method Annotations:

Annotations added to the sequential code meth-ods to deﬁne them as tasks and potentially execute them in parallel.10

Parameter Annotations:

Annotations added to the parameters of anannotated method to indicate the direction (IN, OUT, INOUT) of thedata used by a task.

Figure 2: COMPSs overview.

The runtime [20] system supporting the model follows a master-worker ar-chitecture. The master node, on which the main code of the application runs,orchestrates the execution of the applications and its tasks on the underlyinginfrastructure – described in an XML conﬁguration ﬁle. For that purpose, itintercepts calls to methods annotated as tasks and, for each detected call, itanalyses the data dependencies with previous tasks according to deﬁned param-eter annotations. As the result of this analysis, the runtime builds a DirectedAcyclic Graph (DAG) where nodes represent tasks and edges represent data de-pendencies between them; thus, the runtime infers the application parallelism.The master node transparently schedules and submits a task execution on aworker node handling the required data transfers. If a partial failure raises dur-ing a task execution, the master node handles it with job re-submission andre-schedule techniques.COMPSs guarantees the portability across diﬀerent computing platformssuch as clusters, supercomputing machines, clouds, or container-managed in-frastructures without modifying the application code. Also, the runtime allowsthe usage of third-party plugins by implementing a simple interface to extendits usage to new infrastructures or change the scheduling policy.11urthermore, the COMPSs framework also provides a live-monitoring toolthrough a built-in web interface. For further details on executions, users canenable the instrumentation of their application using Extrae [31] and generatepost-mortem traces that can be analysed with Paraver [32].

A COMPSs application in Java is composed of three parts: • Main application:

Sequential code deﬁning the workﬂow of the applica-tion. It must contain calls to class or object methods annotated as tasksso that, at execution time, they can be asynchronously executed in theavailable resources. • Remote Methods:

Code containing the implementation of the tasks.This code can be in the same ﬁle than the application’s main code or inone or more separate ﬁles. • Annotated Interface:

List of annotated methods that can be remotelyexecuted as tasks. It contains one entry per task deﬁning the

Method An-notation , the object or class method name, and one

Parameter Annotation per method parameter.Notice that COMPSs Java applications do not require any special API,pragma or construct since COMPSs instruments the application’s code at ex-ecution time to detect the tasks deﬁned in the annotated interface. Hence,the COMPSs annotations do not interfere with the applications’ code, and allapplications can be tested sequentially.

The Python syntax in COMPSs is supported through a binding: PyCOMPSs.This Python binding is supported by a Binding-commons layer which focuses onenabling the functionalities of the runtime to other languages (currently, Pythonand C/C++). It has been designed as an API with a set of deﬁned functions.12t is written in C and performs the communication with the runtime throughthe JNI [33].In contrast with the Java syntax, all PyCOMPSs annotations are done inline.The Method Annotations are in the form of Python decorators. Hence, the userscan add the @task decorator on top of a class or object method to indicate thatits invocations will become tasks at execution time. Furthermore, the ParameterAnnotations are contained inside the Method Annotation.Listing 1 shows an example of a task annotation. The ﬁrst line contains thetask annotation in the form of a Python decorator while the rest of the code isa regular python function. The parameter c is of type INOUT, and parameters a , and b are set to the default type IN. The directionality tags are used atexecution time to derive the data dependencies between tasks and are appliedat an object level, taking into account its references to identify when two tasksaccess the same object. @task(c=INOUT) def multiply(a, b, c): c += a * b Listing 1: PyCOMPSs Task annotation example.

A tiny synchronisation API completes the PyCOMPSs syntax. For instance,as shown in Listing 2, the compss_wait_on waits until all the tasks modifyingthe result ’s value are ﬁnished and brings the value to the node executing themain program (line 4). Once the value is retrieved, the execution of the mainprogram code is resumed. Given that PyCOMPSs is mostly used in distributedenvironments, synchronising may imply a data transfer from remote storage ormemory space to the node executing the main program. for block in data: presult = word_count(block) reduce_count(result, presult) final_result = compss_wait_on(result) Listing 2: PyCOMPSs synchronisation API example.

Similarly, the API includes a compss_open(file_name, mode='r') to syn-13hronise ﬁles, and a compss_barrier() to explicitly wait for the completion ofall the previous tasks.

Figure 3 illustrates the basic concepts in Kafka and how they relate to eachother.

Records – each blue box in the ﬁgure – are key-value pairs containingapplication-level information registered along with its publication time.

Figure 3: Description of Kafka’s basic concepts.

Kafka users deﬁne several categories or topics to which records belong. Kafkarelies on ZooKeeper [34, 35] to store each topic as a partitioned log with anarbitrary number of partitions and maintains a conﬁgurable number of par-tition replicas across the cluster to provide fault tolerance and record-accessparallelism. Each partition contains an immutable, publication-time-orderedsequence of records each uniquely identiﬁed by a sequential id number knownas the oﬀset of the record. The example in the ﬁgure deﬁnes two topics (

TopicA and

Topic B ) with 2 and 3 partitions, respectively.14inally,

Producers and

Consumers are third-party application componentsthat interact with Kafka to publish and retrieve data. The former add newrecords to the topics of their choice, while the latter subscribe to one or moretopics for receiving records related to them. Consumers can join in

Consumergroups . Kafka ensures that each record published to a topic is delivered toat least one consumer instance within each subscribing group; thus, multipleprocesses on remote machines can share the processing of the records of thattopic. Although most often delivered exactly once, records might duplicatewhen one consumer crashes without a clean shutdown and another consumerwithin the same group takes over its partitions.Back to the example in the ﬁgure,

Producer A publishes one record to

TopicA , and

Producer B publishes two records, one to

Topic A and one to

Topic B . Consumer A , with a group of its own, processes all the records in

Topic A and

Topic B . Since

Consumer B and

Consumer C belong to the same consumergroup, they share the processing of all the records from

Topic B .Besides the Consumer and Producer API, Kafka also provides the StreamProcessor and Connector APIs. The former, usually used in the intermediatesteps of the ﬂuent stream processing, allows application components to consumean input stream from one or more topics and produce an output stream to oneor more topics. The latter is used for connecting producers and consumers toalready existing applications or data systems. For instance, a connector to adatabase might capture every change to a table.

4. Architecture

Figure 4 depicts a general overview of the proposed solution. When exe-cuting regular task-based workﬂows, the application written following the pro-gramming model interacts with the runtime to spawn the remote execution oftasks and retrieve the desired results. Our proposal includes a representation ofa stream (

DistroStream

Interface) that provides applications with homogeneousstream accesses regardless of the stream backend supporting them. Moreover,15e extend the programming model and runtime to provide task annotations andscheduling capabilities for streams.

Figure 4: General architecture.

The following subsections discuss the architecture of the proposed solutionin a bottom-up approach, starting from the representation of a stream (

Dis-troStream

API) and its implementations. Next, we describe the

DistributedStream Library and its internal components. Finally, we detail the integra-tion of this library with the programming model (COMPSs) and the necessaryextensions of its runtime system.

The Distributed Stream is a representation of a stream used by applicationsto publish and receive data values. Its interface provides a common API toguarantee homogeneity on all interactions with streams.As shown in Listing 3, the

DistroStream interface provides a publish method for submitting a single message or a list of messages (lines 5 and 6)and a poll method to retrieve all the currently available unread messages (lines9 and 10). Notice that the latter has an optional timeout parameter (in millisec-onds) to wait until an element becomes available or the speciﬁed time expires.Moreover, the streams can be created with an optional alias parameter (line 2)to allow diﬀerent applications to communicate through them. Also, the inter-16 // INSTANTIATION public DistroStream(String alias) throws RegistrationException; // PUBLISH METHODS public abstract void publish(T message) throws BackendException; public abstract void publish(List messages) throws BackendException; // POLL METHODS public abstract List poll() throws BackendException; public abstract List poll(long timeout) throws BackendException; // METADATA METHODS public StreamType getStreamType(); public String getAlias(); public String getId(); // STREAM STATUS public boolean isClosed(); // CLOSE STREAM public final void close(); Listing 3: Distributed Stream Interface in Java. face provides other methods to query stream metadata; such as the stream type(line 13), id (line 14), or alias (line 15). Finally, the interface includes methodsto check the status of a stream (line 18), and to close it (line 21).Due to space constraints, Listing 3 only shows the Java interface, but ourprototype also provides the equivalent interface in Python.

Figure 5: DistroStream class relationship.

As shown in Figure 5, two diﬀerent implementations of the

DistroStream

API provide the speciﬁc logic to support object and ﬁle streams. Object streamsare suitable when sharing data within the same language or framework. On the17ther hand, ﬁle streams allow diﬀerent frameworks and languages to share data.For instance, the ﬁles generated by an MPI simulation in C or Fortran can bereceived through an stream and processed in a Python or Java application.

ObjectDistroStream (ODS) implements the generic

DistroStream inter-face to support object streams. Each ODS has an associated

ODSPublisher and

ODSConsumer that interact appropriately with the software handling themessage transmission (streaming backend). The ODS instantiates them uponthe ﬁrst invocation of a publish or a poll method respectively. This behaviourguarantees that the same object stream has diﬀerent publisher and consumerinstances when accessed from diﬀerent processes, and that the producer andconsumer instances are only registered when required, avoiding unneeded regis-trations on the streaming backend.At its current state, the available implementation is backed by Kafka, butthe design is prepared to support many backends. Notice that the ODS, the

ODSPublisher , and the

ODSConsumer are just abstractions to hide the interac-tion with the underlying backend. Hence, any other backend (such as an MQTTbroker) can be supported without any modiﬁcation at the workﬂow level by im-plementing the functionalities deﬁned in these abstractions.Considering the Kafka concepts introduced in Section 3.2, each ODS becomesa Kafka topic named after the stream id. When created, the

ODSPublisher in-stantiates a

KafkaProducer whose publish method builds a new

ProducerRecord and submits it to the corresponding topic via the

KafkaProducer.send method.If the publish invocation sends several messages, the

ODSPublisher iterativelyperforms the publishing process for each message so that Kafka registers it asseparated records.Likewise, a new

KafkaConsumer is instantiated along with an

ODSConsumer .Then, the

KafkaConsumer is registered to a consumer group shared by all theconsumers of the same application to avoid replicated messages, and subscribedto the topic named after the id of the stream. Hence, the poll method retrieves18 list of

ConsumerRecords and deserialises their values. To ensure that recordsare processed exactly-once, consumers also interact with Kafka’s

AdminClient to delete all the processed records from the database. // PRODUCER void produce(List objs) { // Create stream (alias is not mandatory) String alias = "myStream"; ObjectDistroStream ods = new ObjectDistroStream<>(alias); // Metadata getters System.out.println("Stream Id: " + ods.getId()); System.out.println("Stream Alias: " + ods.getAlias()); System.out.println("Stream Type: " + ods.getStreamType()); // Publish (single element or list) for (T obj : objs) { ods.publish(obj); } ods.publish(objs); // Close stream ods.close() } // CONSUMER void consume(ObjectDistroStream ods) { // Poll current elements (without timeout) if (!ods.isClosed()) { List newElems = ods.poll(); } // Poll until stream is closed (with timeout) while (!ods.isClosed()) { List newElems = ods.poll(5) } } Listing 4: Object Streams (ODS) example in Java.

Listing 4 shows an example using object streams in Java. Notice that thestream creation (line 5) forces all the stream objects to be of the same type T .Internally, the stream serialises and deserialises the objects so that the applica-tion can publish and poll elements of type T directly to/from the stream. Aspreviously explained, the example also shows the usage of the publish methodfor a single element (line 12) or a list of elements (line 14), the poll methodwith the optional timeout parameter (lines 23 and 27, respectively), and thecommon API calls to close the stream (line 16), check its status (line 22), andretrieve metadata information (lines 7 to 9).Due to space constraints, the example only shows the ODS usage in Java,19ut our prototype provides an equivalent implementation in Python. The

FileDistroStream implementation (FDS) backs the

DistroStream upto support the streaming of ﬁles. Like ODS, its design allows using diﬀerentbackends; however, at its current state, it uses a custom implementation thatmonitors the creation of ﬁles inside a given directory. The

Directory Monitor backend sends the ﬁle locations through the stream and relies on a distributedﬁle system to share the ﬁle content. Thus, the monitored directory must beavailable to every client on the same path. // PRODUCER void produce(String baseDir, List fileNames) throws IOException { // Create stream (alias is not mandatory) String alias = "myStream"; FileDistroStream fds = new FileDistroStream<>(alias, baseDir); // Publish files (no need to explicitly call the publish // method, the baseDir directory is automatically monitored) for (String fileName : fileNames) { String filePath = baseDir + fileName; try (BufferedWriter writer = new BufferedWriter(new FileWriter(filePath))) { writer.write(...); } } // Close stream fds.close() } // CONSUMER void consume(FileDistroStream fds) { // Poll current elements (without timeout) if (!fds.isClosed()) { List newFiles = fds.poll(); } // Poll until stream is closed (with timeout) while (!fds.isClosed()) { List newFiles = fds.poll(5) } } Listing 5: File Streams (FDS) example in Java.

Listing 5 shows an example using ﬁle streams in Java. Notice that the FDSinstantiation (line 5 in the listing) requires a base directory to monitor thecreation of ﬁles and that it optionally accepts an alias argument to retrieve thecontent of an already existing stream. Also, ﬁles are not explicitly published on20he stream since the base directory is automatically monitored (lines 8 to 13).Instead, regular methods to write ﬁles are used. However, the consumer mustexplicitly call the poll method to retrieve a list of the newly available ﬁle pathsin the stream (lines 22 and 26). As with ODS, applications can also use thecommon API calls to close the stream (line 15), check its status (lines 21 and25), and retrieve metadata information.Due to space constraints, the example only shows the FDS usage in Java,but our prototype provides an equivalent implementation in Python.

The Distributed Stream Library (

DistroStreamLib ) handles the stream ob-jects and provides three major components. First, the

DistroStream

API andimplementations described in the previous sections.Second, the library provides the DistroStream Client that must be availablefor each application process. The client is used to forward any stream metadatarequest to the DistroStream Server or any stream data access to the suitablestream backend (i.e.,

Directory Monitor , or

Kafka ). To avoid repeated queriesto the server, the client stores the retrieved metadata in a cache-like fashion.Either the Server or the backend can invalidate the cached values.Third, the library provides the

DistroStream Server process that is uniquefor all the applications sharing the stream set. The server maintains a registryof active streams, consumers, and producers with the purpose of coordinatingany stream data or metadata access. Among other responsibilities, it is incharge of assigning unique ids to new streams, checking the access permissionsof producers and consumers when requesting publish and poll operations, andnotifying all registered consumers when the stream has been completely closedand there are no producers remaining.Figure 6 contains a sequence diagram that illustrates the interaction of thediﬀerent Distributed Stream Library components when serving a user petition.The

DistroStream implementation used by the applications always forwardsthe requests to the

DistroStream Client available on the process. The client21 igure 6: Sequence diagram of the Distributed Stream Library components. communicates with the

DistroStream Server for control purposes, and retrievesfrom the backend the real data.

As already mentioned in Section 3.1, the prototype to evaluate Hybrid Work-ﬂows is based on the COMPSs workﬂow manager. At programming-modellevel, we have extended the COMPSs Parameter Annotation to include a new

STREAM type.As shown in Listing 6, on the one hand, the users declare producer tasks(methods that write data into a stream) by adding a parameter of type STREAMand direction OUT (lines 3 to 7 in the listing). On the other hand, the usersdeclare consumer tasks (methods that read data from a stream) by adding aparameter of type STREAM and direction IN (lines 9 to 13 in the listing). Inthe current design, we have not considered INOUT streams because we do notimagine a use case where the same method writes data into its own stream.However, it can be easily extended to support such behaviour when required.22 public interface Itf { @Method(declaringClass = "Producer") Integer sendMessages( @Parameter(type = Type.STREAM, direction = Direction.OUT) DistroStream stream ); @Method(declaringClass = "Consumer") Result receiveMessages( @Parameter(type = Type.STREAM, direction = Direction.IN) DistroStream stream ); } Listing 6: Stream parameter annotation example in Java.

Furthermore, we want to highlight that this new annotation allows inte-grating streams smoothly with any other previous annotation. For instance,Listing 7 shows a single producer task that uses two parameters: a streamparameter typical of dataﬂows (lines 5 and 6) and a ﬁle parameter typical oftask-based workﬂows (lines 7 and 8). public interface Itf { @Method(declaringClass = "Producer") Integer sendMessages( @Parameter(type = Type.STREAM, direction = Direction.OUT) DistroStream stream, @Parameter(type = Type.FILE, direction = Direction.IN) String file ); } Listing 7: Example combining stream and ﬁle parameters in Java.

As depicted in Figure 7, COMPSs registers the diﬀerent tasks from the ap-plication’s main code through the Task Analyser component. Then, it buildsa task graph based on the data dependencies and submits it to the

Task Dis-patcher . The Task Dispatcher interacts with the Task Scheduler to schedule thedata-free tasks when possible and, eventually, submit them to execution. The23 igure 7: Structure of the internal COMPSs components. execution step includes the job creation, the transfer of the input data, the jobtransfer to the selected resource, the real task execution on the worker, and theoutput retrieval from the worker back to the master. If any of these steps fail,COMPSs provides fault-tolerant mechanisms for partial failures. Also, once thetask has ﬁnished, COMPSs stores the monitoring data of the task, synchronisesany data required by the application, releases the data-dependent tasks so thatthey can be scheduled, and deletes the task.Therefore, the new Stream annotation has forced modiﬁcations in the TaskAnalyser and Task Scheduler components. More speciﬁcally, notice that astream parameter does not deﬁne a traditional data dependency between aproducer and a consumer task since both tasks can run at the same time. How-ever, there is some information that must be stored so that the Task Schedulercan correctly handle the available resources and the data locality. In this sense,when using the same stream object, our prototype prioritises producer tasksover consumer tasks to avoid wasting resources when a consumer task is wait-ing for data to be produced by a non-running producer task. Moreover, theTask Scheduler assumes that the resources that are running (or have run) pro-ducer tasks are the data locations for the stream. This information is used toschedule the consumer tasks accordingly and minimise as much as possible thedata transfers between nodes. 24 igure 8: COMPSs and Distributed Stream Library deployment.

Regarding the components’ deployment, as shown in Figure 8, the COMPSsmaster spawns the

DistroStream Server and the required backend. Furthermore,it includes a

DistroStream Client to handle the stream accesses and requests per-formed on the application’s main code. On the other hand, the COMPSs workersonly spawn a

DistroStream Client to handle the stream accesses and requestsperformed on the tasks. Notice that the COMPSs master-worker communicationis done through NIO [36], while the

DistroStream Server-Client communicationis done through Sockets.

5. Use Cases

Enabling Hybrid task-based workﬂows and dataﬂows into a single program-ming model allows users to deﬁne new types of complex workﬂows. We introducefour patterns that appear in real-world applications so that the users identify-ing these patterns in their workﬂows can beneﬁt from the new capabilities andperformance improvements of Hybrid Workﬂows. Next subsections provide in-depth analysis of each use case.

One of the main drawbacks of task-based workﬂows is waiting for task com-pletion to process its results. Often, the output data is generated continuously25uring the task execution rather than at the end. Hence, enabling data streamsallows users to process the data as it is generated. @constraint(computing_units=CORES_SIMULATION) @task(varargs_type=FILE_OUT) def simulation(num_files, *args): ... @constraint(computing_units=CORES_PROCESS) @task(input_file=FILE_IN, output_image=FILE_OUT) def process_sim_file(input_file, output_image): ... @constraint(computing_units=CORES_MERGE) @task(output_gif=FILE_OUT, varargs_type=FILE_IN) def merge_reduce(output_gif, *args): ... def main(): num_sims, num_files, sim_files, output_files, output_gifs = ... for i in range(num_sims): simulation(num_files, *sim_files[i]) for i in range(num_sims): for j in range(num_files): process_sim_file(sim_files[i][j], output_images[i][j]) for i in range(num_sims): merge_reduce(output_gifs[i], *output_images[i]) for i in range(num_sims): output_gifs[i] = compss_wait_on_file(output_gifs[i]) Listing 8: Simulations’ application in Python without streams.

For instance, Listing 8 shows the code of a pure task-based application thatlaunches num_sims simulations (line 21). Each simulation produces output ﬁlesat diﬀerent time steps of the simulation (i.e., an output ﬁle every iteration ofthe simulation). The results of these simulations are processed separately bythe process_sim_file task (line 25) and merged to a single GIF per simulation(line 28). The example code also includes the task deﬁnitions (lines 1 to 13)and the synchronisation API calls to retrieve the results (line 31).Figure 9 shows the task graph generated by the previous code when runningwith 2 simulations ( num_sims ) and 5 ﬁles per simulation ( num_files ). The simulation tasks are shown in blue, the process_sim_file in white and red,26nd the merge_reduce in pink. Notice that the simulations and the process-ing of the ﬁles cannot run in parallel since the task-based workﬂow forces thecompletion of the simulation tasks to begin any of the processing tasks.

Figure 9: Task graph of the simulation application without streaming.

On the other hand, Listing 9 shows the code of the same application usingstreams to retrieve the data from the simulations as it is generated and for-warding it to its processing tasks. The application initialises the streams (lines20 to 22), launches num_sims simulations (line 25), spawns a process task foreach received element in each stream (line 34), merges all the output ﬁles intoa single GIF per simulation (line 37), and synchronises the ﬁnal results. The process_sim_file and merge_reduce task deﬁnitions are identical to the pre-vious example. Conversely, the simulation task deﬁnition uses the

STREAM_OUT annotation to indicate that one of the parameters is a stream where the taskis going to publish data. Also, although the simulation, merge, and synchroni-sation phases are very similar to the pure task-based workﬂow, the processingphase is completely diﬀerent (lines 27 to 34). When using streams, the main codeneeds to check the stream status, retrieve its published elements, and spawn a process_sim_task per element. However, the complexity of the code does notincrease signiﬁcantly when adding streams to an existing application.Figure 10 shows the task graph generated by the previous code when run-ning with the same parameters than the pure task-based example (2 simulationsand 5 ﬁles per simulation). The colour code is also the same than the previous27 @constraint(computing_units=CORES_SIMULATION) @task(fds=STREAM_OUT) def simulation(fds, num_files): ... @constraint(computing_units=CORES_PROCESS) @task(input_file=FILE_IN, output_image=FILE_OUT) def process_sim_file(input_file, output_image): ... @constraint(computing_units=CORES_MERGE) @task(output_gif=FILE_OUT, varargs_type=FILE_IN) def merge_reduce(output_gif, *args): ... def main(): num_sims, num_files, output_images, output_gifs = ... input_streams = [None for _ in range(num_sims)] for i in range(num_sims): input_streams[i] = FileDistroStream(base_dir=stream_dir) for i in range(num_sims): simulation(input_streams[i], num_files) for i in range(app_args.num_simulations): while not input_streams[i].is_closed(): new_files = input_streams[i].poll() for input_file in new_files: output_image = input_file + ".out" output_images[i].append(output_image) process_sim_file(input_file, output_image) for i in range(app_args.num_simulations): merge_reduce(output_gifs[i], *output_images[i]) for i in range(app_args.num_simulations): output_gifs[i] = compss_wait_on_file(output_gifs[i]) Listing 9: Simulations’ application in Python with streams. example: the simulation tasks are shown in blue, the process_sim_file inwhite and red, and the merge_reduce in pink. Notice that streams enable theexecution of the processing tasks while the simulations are still running; poten-tially reducing the total execution time and increasing the resources utilisation(see Section 6.2 for further details). 28 igure 10: Task graph of the simulation application with streaming.

Streams can also be used to communicate data between tasks without wait-ing for the tasks’ completion. This technique can be useful when performingparameter sweep, cross-validation, or running the same algorithm with diﬀerentinitial points.

Figure 11: Task graph of the multi-simulations application.

For instance, Figure 11 shows three algorithms running simultaneously thatexchange control data at the end of every iteration. Notice that the data ex-change at the end of each iteration can be done synchronously by stopping all thesimulations, or asynchronously by sending the updated results and processing29he pending messages in the stream (even though some messages of the actualiteration might be received in the next iteration). Furthermore, each algorithmcan run a complete task-based workﬂow to perform the iteration calculus ob-taining a nested task-based workﬂow inside a pure dataﬂow.

Many applications receive its data continuously from external streams (i.e.,IoT sensors) that are not part of the application itself. Moreover, depending onthe workload, the stream data can be produced by a single task and consumedby many tasks (one to many), produced by many tasks and consumed by a singletask (many to one), or produced by many tasks and consumed by many tasks(many to many). The

Distributed Stream Library supports all three scenariostransparently, and allows to conﬁgure the consumer mode to process the dataat least once, at most once, or exactly once when using many consumers.

Figure 12: Task graph of the sensor application.

Figure 12 shows an external sensor (

Stream 1 in the ﬁgure) producing data30hat is ﬁltered simultaneously by 4 tasks (coloured in white). The relevant datais then extracted from an internal stream (

Stream 2 ) by an intermediate task(task 6, coloured in red), and used to run a task-based algorithm. The result is ahybrid task-based workﬂow and dataﬂow. Also, the sensor uses a one-to-manystream conﬁgured to process the data exactly once, and the ﬁlter (colouredin white) tasks use a many-to-one stream to publish data to the extract task(coloured in red).

Our proposal also allows to combine task-based workﬂows and dataﬂows atdiﬀerent levels; having nested task-based workﬂows inside a dataﬂow task orvice-versa. This feature enables the internal parallelisation of tasks, allowingworkﬂows to scale up and down resources depending on the workload.

Figure 13: Task graph of the hybrid nested application.

For instance, Figure 13 shows a dataﬂow with two nested task-based work-ﬂows. The application is similar to the previous use case: the task 1 (coloured31n pink) produces the data, task 2 (in white) ﬁlters it, task 3 (in blue) extractsand collects the data, and task 4 (in red) runs a big computation.Notice that, in the previous use case, the application always has 4 ﬁlter tasks.However, in this scenario, the ﬁlter task has a nested task-based workﬂow thataccumulates the received data into batches and spawns a new ﬁlter task perbatch. This technique dynamically adapts the resource usage to the amountof data received by the input stream. Likewise, the big computation task alsocontains a nested task-based workﬂow. This shows that users can parallelisesome computations internally without modifying the original dataﬂow.

6. Evaluation

This section evaluates the performance of the new features enabled by ourprototype when using data streams against their equivalent implementationsusing task-based workﬂows. Furthermore, we analyse the stream writer andreader processes’ scalability and load balancing. Finally, we provide an in-depthanalysis of the COMPSs runtime performance by comparing the task analysis,task scheduling, and task execution times when using pure task-based workﬂowsor streams.

The results presented in this section have been obtained using the MareNos-trum 4 Supercomputer [37] located at the Barcelona Supercomputing Center(BSC). Its current peak performance is 11.15 Petaﬂops. The supercomputer iscomposed by 3456 nodes, each of them with two Intel ® Xeon Platinum 8160 (24cores at 2,1 GHz each). It has 384.75 TB of main memory, 100Gb Intel ® Omni-Path Full-Fat Tree Interconnection, and 14 PB of shared disk storage managedby the Global Parallel File System.Regarding the software, we have used DistroStream Library (available at [38]),COMPSs version 2.5.rc1909 (available at [39]), and Kafka version 2.3.0 (avail-able at [40]). We have also used Java OpenJDK 8 131, Python 2.7.13, GCC7.2.0, and Boost 1.64.0. 32 .2. Gain of processing data continuously

As explained in the ﬁrst use case in Section 5.1, one of the signiﬁcant advan-tages when using data streams is to process data continuously as it is generated.For that purpose, Figure 14 compares the Paraver [32] traces of the originalCOMPSs execution (pure task-based workﬂow) and the execution using HybridWorkﬂows. Each trace shows the available threads in the vertical axis and theexecution time in the horizontal axis - 36 s in both traces. Also, each colourrepresents the execution of a task type; corresponding to the colours shown inthe task graphs of the ﬁrst use case (see Section 5.1). The green ﬂags indicatewhen a simulation has generated all its output ﬁles and has closed its associatedwriting stream. Both implementations are written in Python, and the DirectoryMonitor is set as stream backend. (a) Original COMPSs execution(b) Execution with Hybrid Workﬂows

Figure 14: Paraver traces to illustrate the gain of processing data continuously.

In contrast to the original COMPSs execution (top), the execution usingHybrid Workﬂows (bottom) executes the processing tasks (white and red) whilethe simulations (blue) are still running; signiﬁcantly reducing the total execu-tion time and increasing the resources utilisation. Moreover, the merge_reduce tasks (pink) are able to begin its execution even before the simulation tasksare ﬁnished, since all the streams have been closed and the process_sim_file tasks have already ﬁnished.In general terms, the gain of the implementation using Hybrid Workﬂowswith respect to the original COMPSs implementation (calculated following Equa-33ion 1) is proportional to the number of tasks that can be executed in parallelwhile the simulation is active. Therefore, we perform an in-depth analysis ofthe trade-oﬀ between the generation and process times. It is worth mentioningthat we deﬁne the generation time as the time elapsed between the generationof two elements of the simulation. Hence, the total duration of the simulationis the generation time multiplied by the number of generated elements. Also,the process time is deﬁned as the time to process a single element (that is, theduration of the process_sim_file task).

Gain = ExecutionT ime original − ExecutionT ime hybrid

ExecutionT ime original (1)The experiment uses 2 nodes of 48 cores each. Since the COMPSs masterreserves 12 cores, there are two available workers with 36 and 48 cores respec-tively. The simulation is conﬁgured to use 48 cores, leaving 36 available coreswhile it is active and 84 available cores when it is over. Also, the process tasksare conﬁgured to use one single core.

Figure 15: Average execution time and gain of a simulation with increasing generation time.

Figure 15 depicts the average execution time of 5 runs where each simula-tion generates 500 elements. The process time is ﬁxed to 60,000 ms, while thegeneration time between stream elements varies from 100 ms to 2,000 ms. Forshort generation times, almost all the processing tasks are executed when the34eneration task has already ﬁnished, obtaining no gain with respect to the imple-mentation with objects. For instance, when generating elements every 100 ms,the simulation takes 50,000 ms in total (500 elements · ms/element ). Sincethe process tasks last 60,000 ms, none of them will have ﬁnished before thesimulation ends; leading to almost no gain.When increasing the generation time, more and more tasks can be executedwhile the generation is still active; achieving a 19% gain when generating streamelements every 500 ms. However, the gain is limited because the last generatedelements are always processed when the simulation is over. Therefore, increasingthe generation time from 500 ms to 2,000 ms only raises the gain from 19% to23%. Figure 16: Average execution time and gain of a simulation with increasing process time.

On the other hand, Figure 16 illustrates the average execution time of 5runs that generate 500 process tasks with a ﬁxed generation time of 100 ms anda process time varying from 5,000 ms up to 60,000 ms. Notice that the totalsimulation time is 50,000 ms. When the processing time is short, many taskscan be executed while the generation is still active; achieving a maximum 23%gain when the processing time is 5,000ms.As with the previous case, when the processing time is increased, the numberof tasks that can be executed while the generation is active also decreases and,thus, the gain. Also, the gain is almost zero when the processing time is big35nough (60,000ms) so that none of the process tasks will have ﬁnished beforethe generation ends.

Many workﬂows are composed of several iterative computations runningsimultaneously until a certain convergence criterion is met. As described inSection 5.2, this technique is useful when performing parameter sweep, cross-validation, or running the same algorithm with diﬀerent initial points.

Figure 17: Parallel iterative computations. Pure task-based workﬂow and Hybrid Workﬂowshown at left and right, respectively.

To this end, each computation requires a phase at the end of each iteration toexchange information with the rest. When using pure task-based workﬂows, thisphase requires to stop all the computations at the end of each iteration, retrieveall the states, create a task to exchange and update all the states, transferback all the new states, and resume all the computations to the next iteration.The left task graph of Figure 17 shows an example of such workﬂows with twoiterations of two parallel computations. The ﬁrst two red tasks initialise thestate of each computation, the pink tasks perform the computation of eachiteration, and the blue tasks retrieve and update the state of each computation.36onversely, when using Hybrid Workﬂows, each computation can exchangethe information at the end of each iteration asynchronously by writing andreading the states to/from streams. This technique avoids splitting each com-putation into tasks, stopping and resuming each computation at every iteration,and synchronising all the computations to exchange data. The right task graphof Figure 17 depicts the equivalent Hybrid Workﬂow of the previous example.Each computation is run in a single task (white) that performs the state initial-isation, all the iterations, and all the update phases at the end of each iteration.

Figure 18: Average execution time and gain of a simulation with an increasing number ofiterations.

Using the previous examples, Figure 18 evaluates the performance gain ofavoiding the synchronisation and exchange process at the end of each iteration(calculated following Equation 2). Hence, the benchmark executes the puretask-based workﬂow (blue) and the Hybrid Workﬂow (green) versions of thesame workﬂow written in Java and using Kafka as streaming backend. Also,it is composed of two independent computations with a ﬁxed computation periteration (2,000 ms) and an increasing number of iterations. The results shownare the mean execution times of 5 runs of each conﬁguration.

Gain = ExecutionT ime pure task − based − ExecutionT ime hybrid

ExecutionT ime pure task − based (2)37otice that the total gain is inﬂuenced by three factors: the removal of thesynchronisation task at the end of each iteration, the cost of transferring thestate between the process and the synchronisation tasks, and, the division ofthe state’s initialisation and process. Although we have reduced the state ofeach computation to 24 bytes and used a single worker machine to minimise theimpact of the transfer overhead, the second and third factors become importantwhen running a small number of iterations (below 32), reaching a maximumgain of 42% when running a single iteration. For a larger number of iterations(over 32), the removal of the synchronisation becomes the main factor, and thetotal gain reaches a steady state with a gain around 33%. Our prototype supports N-M streams, meaning that any stream can have anarbitrary amount of writers and readers. To evaluate the performance and loadbalance, we have implemented a Java application that uses a single stream andcreates N writer tasks and M reader tasks. Although our writer and reader tasksuse a single core, we spawn each of them in separated nodes so that the datamust be transferred. In more sophisticated use cases, each task could beneﬁtfrom an intra-node technology (such as OpenMP) to parallelise the processingof the stream data.

Figure 19: Average execution time and eﬃciency (left and right, respectively) with increasingnumber of readers and diﬀerent number of writers.

Figure 19 depicts the average execution time (left) and the eﬃciency (right)of 5 runs with an increasing number of readers. Each series uses a diﬀerent38umber of writers, also going from 1 to 8. Also, the writers publish 100 elementsin total, the size of the published objects is 24 bytes, and the time to process anelement is set to 1,000ms. The eﬃciency is calculated using the ideal executiontime as reference; i.e. the number of elements multiplied by the time to processan element and divided by the number of readers.Since the execution time is mainly due to the processing of the elements inthe reader tasks, when increasing the number of writers, there are no signiﬁ-cant diﬀerences. However, for all the cases, increasing the number of readerssigniﬁcantly impacts the execution time, achieving a 4.84 speed-up with 8 read-ers. Furthermore, the eﬃciencies using 1 reader are close to the ideal (87%on average) because the only overheads are the creation of the elements, thetask spawning, and the data transfers. However, when increasing the numberof readers, the load imbalance signiﬁcantly aﬀects eﬃciency; achieving around50% eﬃciency with 8 readers.

Figure 20: Number of stream elements processed per reader.

It is worth mentioning that the achieved speed-up is lower than the ideal (8)due to load imbalance. Thus, the elements processed by each reader task arenot balanced since elements are assigned to the ﬁrst process that requests them.Figure 20 illustrates an in-depth study of the load imbalance when running 1,2, 4, or 8 readers. Notice that when running with 2 readers, the ﬁrst reader getsalmost 75% of the elements while the second one only processes 25% of the totalload. The same pattern is shown when increasing the number of readers; wherehalf of the tasks perform 70% of the total load. For instance, when runningwith 4 readers, 2 tasks perform 69% of the work (34.5% each), while the restonly performs 31% of the total load (15.5% each). Similarly, when running with39 readers, 4 tasks perform 70% of the total load (17.5% each), while the otherfour only process 30% (7.5% each).At its current state, the

Distributed Stream Library does not implement anyload balance technique, nor limit the number of elements retrieved by each poll call. As future work, since the library already stores the processes registeredto each stream, it could implement some policy and send only a subset of theavailable elements to the requesting process rather than all of them.

To provide a more in-depth analysis of the performance of our prototype, wehave compared each step of the task life-cycle when using

ObjectParameter (OPfrom now on) or

StreamParameter (SP from now on) that use

ObjectDistro-Streams . The following ﬁgures evaluate the task analysis, task scheduling, andtask execution average times of 100 tasks using (a) a single object of increasingsize (from 1 MB to 128 MB) or (b) an increasing number of objects (from1 to 16 objects) of ﬁxed size (8 MB). Both implementations are written inJava and Kafka is used as the stream backend. Regarding the task deﬁnition,notice that the OP implementation requires an

ObjectParameter for each objectsent to the task. In contrast, the SP implementation only requires a single

StreamParameter since all the objects are sent through the stream itself.

Figure 21: Task analysis average time for one single parameter with increasing sizes (left) orincreasing number of parameters (right).

Figure 21 compares the task analysis results. The task analysis is the timespent by the runtime to register the task and its parameters into the system. Itis worth mentioning that increasing the object’s size does not aﬀect the anal-40sis time in either the OP nor the SP implementations. There is, however, adiﬀerence around 0.05 ms due to the creation of internal structures to representobject parameters or stream parameters.On the other hand, increasing the number of objects directly aﬀects thetask analysis time because the runtime needs to register each task parameterindividually. For the OP implementation, each object maps to an extra taskparameter, and thus, the task analysis time slightly increases when increasingthe number of objects. Conversely, for the SP implementation, the streamparameter itself is not modiﬁed since we only increase the number of publishedobjects. Hence, the task analysis time remains constant when increasing thenumber of objects.

Figure 22: Task scheduling average time for one single parameter with increasing sizes (left)or increasing number of parameters (right).

Figure 22 compares the task scheduling results. On the one hand, thescheduling time for both implementations varies from 2.05 ms to 2.20 ms butdoes not show any clear tendency to increase when increasing the object’s size.On the other hand, when increasing the number of objects, the scheduling timeincreases for the OP implementation and remains constant for the SP implemen-tation. This behaviour is due to the fact that the default COMPSs schedulerimplements data locality and, thus, the scheduling time is proportional to thenumber of parameters. Similarly to the previous case, increasing the numberof objects increases the number of task parameters for the OP implementa-tion (increasing its scheduling time), but keeps a single parameter for the SPimplementation (maintaining its scheduling time).41 igure 23: Task execution average time for one single parameter with increasing sizes (left)or increasing number of parameters (right).

Figure 23 compares the task execution results. The task execution timecovers the transfer of all the task parameters and the task execution itself.Regarding the SP implementation, the time remains constant around 208 msregardless of the object’s size and the number of objects because the measure-ment only considers the transfer of the stream object itself and the executiontime of the poll method. It is worth mentioning that the actual transfers ofthe objects are done by Kafka when invoking the publish method on the maincode, and thus, they are executed in parallel while COMPSs spawns the task inthe worker machine.Conversely, the execution time for the OP implementation increases withboth the object’s size and the number of objects, since the serialisation andtransfer times also increase. However, the task execution does not need to fetchthe objects (the poll method) since all of them have already been transferred.This trade-oﬀ can be observed in the ﬁgure, where the OP implementationperforms better than the SP implementation when using task parameters smallerthan 48 MB and performs worse for bigger cases. Notice that only the totalobjects’ size is relevant since the same behaviour is shown when using a single48 MB object or 6 objects of 8 MB each.Since the real object transfers when using SP are executed during the publish method and cannot be observed measuring the task execution time, we have alsomeasured the total execution time of the benchmark for both implementations.Figure 24 shows the total execution time with an increasing number of objects of42 MB. In contrast to the previous plot, both implementations have an increasingexecution time proportional to the objects’ size. Also, the SP implementationonly outperforms the OP implementation when using more than 12 objects.

Figure 24: Total execution time with increasing number of parameters.

To conclude, since there are no major diﬀerences regarding the task analysistime nor the task scheduling time, we can safely assume that the use of streamsinstead of regular objects is recommended when the total size of the task pa-rameters exceeds 48 MB and there are more than 12 objects published to thestream.

7. Conclusion and Future Work

This paper demonstrates that task-based workﬂows and dataﬂows can beintegrated into a single programming model to better-cover the needs of the newData Science workﬂows. Using Hybrid Workﬂows, developers can build complexpipelines with diﬀerent approaches at many levels using a single framework.The proposed solution relies on the

DistroStream concept: a generic APIused by applications to handle stream accesses homogeneously regardless of thesoftware backing it. Two implementations provide the speciﬁc logic to supportobject and ﬁle streams. The ﬁrst one,

ObjectDistroStream , is built on top ofKafka to enable object streams. The second one,

FileDistroStream , monitors thecreation of ﬁles inside a directory, sends the ﬁle locations through the stream,and relies on a distributed ﬁle system to share the ﬁle content.43he

DistroStream

API and both implementations are part of the

DistroStream-Lib , which also provides the

DistroStream Client and the

DistroStream Server .While the client acts as a broker on behalf of the application and interactswith the corresponding backend, the server manages the streams’ metadata andcoordinates the accesses.By integrating the

DistroStreamLib into a task-based Workﬂow Manager, itsprogramming model can easily support Hybrid Workﬂows. The described pro-totype extends COMPSs to enable tasks with continuous input and output databy providing a new annotation for stream parameters. Implementing the han-dling of such stream-type values lead to some modiﬁcations on the Task Analyserand Task Scheduler components of the runtime. Using the

DistroStreamLib alsoimplied changes at deployment time since its components need to be spawnedalong with COMPSs. On the one hand, the COMPSs master hosts the

Dis-troStream Server , the required stream backend, and a

DistroStream Client tohandle stream accesses on the application’s main code. On the other hand,each COMPSs worker contains a

DistroStream Client that performs the streamaccesses on tasks. Although the described prototype only builds on COMPSs,it can be used as an implementation reference for any other existing task-basedframework.This paper also presents four use cases illustrating the new capabilities thatthe users may identify in their workﬂows to beneﬁt from the use of HybridWorkﬂows. On the one hand, streams can be internal or external to the appli-cation and can be used to communicate continuous data or control data. Onthe other hand, streams can be accessed inside the main code, native tasks (i.e.,Java or Python), or non-native tasks (i.e., MPI, binaries, and nested COMPSsworkﬂows). Furthermore, the

Distributed Stream Library supports the one tomany, many to one, and many to many scenarios transparently, and allows toconﬁgure the consumer mode to process the data at least once, at most once,or exactly once when using many consumers.The evaluation demonstrates the beneﬁt of processing data continuously asit is generated; achieving a 23% gain with the right generation and process times44nd resources. Also, using streams as control mechanism enabled the removalof synchronisation points when running several parallel algorithms, leading to a33% gain when running more than 32 iterations. Finally, an in-depth analysis ofthe runtime’s performance shows that there are no major diﬀerences regardingthe task analysis time nor the task scheduling time when using streams or objecttasks, and that the use of streams is recommended when the total size of thetask parameters exceeds 48 MB using more than 12 objects.Although the solution is fully functional, some improvements can be made.Regarding the

DistroStream implementations, we plan to extend the

FileDis-troStream to support shared disks with diﬀerent mount-points. On the otherhand, we will add new

ObjectDistroStream ’s backend implementations (apartfrom Kafka), so that users can choose between them without changing the ap-plication’s code. Envisaging that a single

DistroStream Server could becomea bottleneck when managing several applications involving a large number ofcores, we consider replacing the client-server architecture by a peer-to-peer ap-proach. Finally, by highlighting the beneﬁts from hybrid ﬂows, we expect toattract real-world applications to elevate our evaluation to more complex usecases.

Acknowledgements

This work has been supported by the Spanish Government (contracts SEV2015-0493 and TIN2015-65316-P), by Generalitat de Catalunya (contract 2014-SGR-1051), and by the European Commission through the Horizon 2020 Research andInnovation program under contract 730929 (MF2C project). Cristian Ramon-Cortes predoctoral contract is ﬁnanced by the Spanish Government under thecontract BES-2016-076791.

ReferencesReferences [1] P. A. Damkroger, The Intersection of AI, HPC and HPDA: How Next-Generation Workﬂows Will Drive Tomorrows Breakthroughs, Accessed:45ugust 07, 2019 (2018).URL [2] J. Kreps, N. Narkhede, J. Rao, et al., Kafka: A distributed messagingsystem for log processing, in: Proceedings of the NetDB, 2011, pp. 1–7.URL http://people.csail.mit.edu/matei/courses/2015/6.S897/readings/kafka.pdf [3] R. M. Badia, et al., COMP superscalar, an interoperable programmingframework, SoftwareX 3 (2015) 32–36.URL https://doi.org/10.1016/j.softx.2015.10.004 [4] Barcelona Supercomputing Center (BSC), COMP Superscalar — BSC-CNS, Accessed: March 26, 2019.URL http://compss.bsc.es [5] A. Jain, S. P. Ong, W. Chen, B. Medasani, X. Qu, M. Kocher, M. Brafman,G. Petretto, G.-M. Rignanese, G. Hautier, et al., Fireworks: A dynamicworkﬂow system designed for high-throughput applications, Concurrencyand Computation: Practice and Experience 27 (17) (2015) 5037–5059.[6] A. Jain, Introduction to FireWorks (workﬂow software) - FireWorks 1.8.7documentation, Accessed: March 26, 2019.URL https://materialsproject.github.io/fireworks/ [7] K. Wolstencroft, R. Haines, D. Fellows, A. Williams, D. Withers, S. Owen,S. Soiland-Reyes, I. Dunlop, A. Nenadic, P. Fisher, J. Bhagat, K. Belhaj-jame, F. Bacall, A. Hardisty, A. Nieva de la Hidalga, M. P. Balcazar Vargas,S. Suﬁ, C. Goble, The Taverna workﬂow suite: designing and executingworkﬂows of Web Services on the desktop, web or in the cloud, NucleicAcids Research 41 (W1) (2013) W557–W561. doi:10.1093/nar/gkt328 .URL https://doi.org/10.1093/nar/gkt328 [8] Taverna Committers, Apache Taverna, Accessed: April 1, 2019.URL https://taverna.incubator.apache.org/ https://kepler-project.org/ [11] E. Afgan, D. Baker, M. van den Beek, D. Blankenberg, D. Bouvier,M. ˇCech, J. Chilton, D. Clements, N. Coraor, C. Eberhard, B. Gr¨uning,A. Guerler, J. Hillman-Jackson, G. Von Kuster, E. Rasche, N. Soranzo,N. Turaga, J. Taylor, A. Nekrutenko, J. Goecks, The Galaxy platform foraccessible, reproducible and collaborative biomedical analyses: 2016 up-date, Nucleic Acids Res. 44 (W1) (2016) W3–W10.[12] Galaxy Team, Galaxy, Accessed: March 26, 2019.URL https://usegalaxy.org/ [13] M. Abouelhoda, S. A. Issa, M. Ghanem, Tavaxy: Integrating Tavernaand Galaxy workﬂows with cloud computing support, BMC bioinformatics13 (1) (2012) 77.[14] P. Di Tommaso, M. Chatzou, E. W. Floden, P. P. Barja, E. Palumbo,C. Notredame, Nextﬂow enables reproducible computational workﬂows,Nature biotechnology 35 (4) (2017) 316.[15] Barcelona Centre for Genomic Regulation (CRG), Nextﬂow: A DSL forparallel and scalable computational pipelines, Accessed: March 26, 2019.URL [16] M. Wilde, M. Hategan, J. M. Wozniak, B. Cliﬀord, D. S. Katz,I. Foster, Swift: A language for distributed parallel script-ing, Parallel Computing 37 (9) (2011) 633 – 652. doi:https://doi.org/10.1016/j.parco.2011.05.005 .47RL [17] Swift Project Team, The Swift Parallel Scripting Language, Accessed:March 26, 2019.URL http://swift-lang.org/main/ [18] University of Chicago, Parsl: Parallel Scripting in Python, Accessed:March 26, 2019.URL http://parsl-project.org/ [19] Dask Development Team, Dask: Library for dynamic task scheduling(2016).URL https://dask.org [20] F. Lordan, R. M. Badia, et al., ServiceSs: an interoperable programmingframework for the Cloud, Journal of Grid Computing 12 (1) (2014) 67–91.URL https://digital.csic.es/handle/10261/132141 [21] Apache Flink Contributors, Apache Flink, Accessed: August 9, 2019(2014).URL https://flink.apache.org/ [22] Apache Samza Contributors, Apache Samza, Accessed: August 9, 2019(2019).URL http://samza.apache.org/ [23] A. Toshniwal, et al., Storm@ twitter, in: Proceedings of the 2014 ACMSIGMOD international conference on Management of data, ACM, 2014,pp. 147–156.[24] S. Kulkarni, et al., Twitter heron: Stream processing at scale, in: Proceed-ings of the 2015 ACM SIGMOD International Conference on Managementof Data, ACM, 2015, pp. 239–250.4825] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, I. Stoica, Spark:Cluster computing with working sets., HotCloud 10 (10-10) (2010) 95.[26] J. Dean, S. Ghemawat, Mapreduce: Simpliﬁed data processing on largeclusters, Proceedings of the 6th Conference on Symposium on OpeartingSystems Design and Implementation 6 (2004) 10–10.URL http://dl.acm.org/citation.cfm?id=1251254.1251264 [27] M. Zaharia, et al., Discretized Streams: An Eﬃcient and Fault-TolerantModel for Stream Processing on Large Clusters, HotCloud 12 (2012) 10–16.[28] R. Filgueira, R. F. d. Silva, A. Krause, E. Deelman, M. Atkinson, Asterism:Pegasus and Dispel4py Hybrid Workﬂows for Data-Intensive Science, in:2016 Seventh International Workshop on Data-Intensive Computing in theClouds (DataCloud), 2016, pp. 1–8. doi:10.1109/DataCloud.2016.004 .[29] E. Tejedor, R. M. Badia, J. Labarta, et al., PyCOMPSs: Parallel computa-tional workﬂows in Python, The International Journal of High PerformanceComputing Applications (IJHPCA) 31 (2017) 66–82.URL http://dx.doi.org/10.1177/1094342015594678 [30] J. Ejarque, M. Domnguez, R. M. Badia, A hierarchic task-based programming model for distributed heterogeneous computing,The International Journal of High Performance Computing Applica-tions 0 (0) (0) 1094342019845438. arXiv:https://doi.org/10.1177/1094342019845438 , doi:10.1177/1094342019845438 .URL https://doi.org/10.1177/1094342019845438 [31] Barcelona Supercomputing Center (BSC), Extrae Tool, Accessed: July 31,2019.URL https://tools.bsc.es/extrae [32] Barcelona Supercomputing Center (BSC), Paraver Tool, Accessed: July491, 2019.URL https://tools.bsc.es/paraver [33] S. Liang, Java Native Interface: Programmer’s Guide and Reference, 1stEdition, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA,1999.[34] P. Hunt, M. Konar, F. P. Junqueira, B. Reed, ZooKeeper: Wait-free Coor-dination for Internet-scale Systems, in: USENIX annual technical confer-ence, Vol. 8, Boston, MA, USA, 2010, p. 14.[35] Apache ZooKeeper Contributors, Apache ZooKeeper, Accessed: July 31,2019.URL https://zookeeper.apache.org/ [36] Oracle and/or its aﬃliates, Java NIO, Accessed: January 20, 2020 (2010).URL https://docs.oracle.com/javase/1.5.0/docs/guide/nio/index.html [37] Barcelona Supercomputing Center (BSC), MareNostrum IV TechnicalInformation, Accessed: July 31, 2019.URL [38] Workﬂows and Distributed Computing - Barcelona Supercomputing Center(BSC), DistroStream Library GitHub, Accessed: September 3, 2019.URL https://github.com/bsc-wdc/distro-stream-lib [39] Workﬂows and Distributed Computing - Barcelona Supercomputing Center(BSC), COMPSs GitHub, Accessed: July 31, 2019.URL https://github.com/bsc-wdc/compss [40] Apache Kafka Contributors, Apache Kafka, Accessed: July 12, 2019.URL https://github.com/apache/kafkahttps://github.com/apache/kafka