A Programming Model for Hybrid Workflows: combining Task-based Workflows and Dataflows all-in-one
Cristian Ramon-Cortes, Francesc Lordan, Jorge Ejarque, Rosa M. Badia
TThis manuscript has been accepted at the Future Generation Computer Sys-tems (FGCS).DOI: 10.1016/j.future.2020.07.007This manuscript is licensed under CC-BY-NC-ND.1 a r X i v : . [ c s . D C ] J u l Programming Model for Hybrid Workflows:combining Task-based Workflows and Dataflowsall-in-one
Cristian Ramon-Cortes a , Francesc Lordan a , Jorge Ejarque a , Rosa M. Badia a a Barcelona Supercomputing Center (BSC)
Abstract
In the past years, e-Science applications have evolved from large-scale sim-ulations executed in a single cluster to more complex workflows where thesesimulations are combined with High-Performance Data Analytics (HPDA). Toimplement these workflows, developers are currently using different patterns;mainly task-based and dataflow. However, since these patterns are usually man-aged by separated frameworks, the implementation of these applications requiresto combine them; considerably increasing the effort for learning, deploying, andintegrating applications in the different frameworks.This paper tries to reduce this effort by proposing a way to extend task-based management systems to support continuous input and output data toenable the combination of task-based workflows and dataflows (Hybrid Work-flows from now on) using a single programming model. Hence, developers canbuild complex Data Science workflows with different approaches depending onthe requirements. To illustrate the capabilities of Hybrid Workflows, we havebuilt a Distributed Stream Library and a fully functional prototype extendingCOMPSs, a mature, general-purpose, task-based, parallel programming model.The library can be easily integrated with existing task-based frameworks toprovide support for dataflows. Also, it provides a homogeneous, generic, andsimple representation of object and file streams in both Java and Python; en-abling complex workflows to handle any data type without dealing directly withthe streaming back-end.
Preprint submitted to Elsevier July 10, 2020 uring the evaluation, we introduce four use cases to illustrate the newcapabilities of Hybrid Workflows; measuring the performance benefits whenprocessing data continuously as it is generated, when removing synchronisationpoints, when processing external real-time data, and when combining task-basedworkflows and dataflows at different levels. The users identifying these patternsin their workflows may use the presented uses cases (and their performanceimprovements) as a reference to update their code and benefit of the capabilitiesof Hybrid Workflows. Furthermore, we analyse the scalability in terms of thenumber of writers and readers and measure the task analysis, task scheduling,and task execution times when using objects or streams.
1. Introduction
For many years, large-scale simulations, High-Performance Data Analytics(HPDA), and simulation workflows have become a must to progress in manyscientific areas such as life, health, and earth sciences. In such a context, there isa need to adapt the High-Performance infrastructure and frameworks to supportthe needs and challenges of workflows combining these technologies [1].Traditionally, developers have tackled the parallelisation and distributed ex-ecution of these applications following two different strategies. On the one hand,task-based workflows orchestrate the execution of several pieces of code ( tasks )that process and generate data values. These tasks have no state and, during itsexecution, they are isolated from other tasks; thus, task-based workflows consistof defining the data dependencies among tasks. On the other hand, dataflowsassume that tasks are persistent executions with a state that continuously re-ceive/produce data values (streams). Through dataflows, developers describehow the tasks communicate to each other.Regardless of the workflow type, directed graphs are a useful visualisationand management tool. Figure 1 shows the graph representation of a task-basedworkflow (left) and its equivalent dataflow (right). The task dependency graphconsists of a producer task (coloured in pink) and five consumer tasks (coloured3n blue) that can run in parallel after the producer completes. The dataflowgraph also has a producer task (coloured in pink), but one single stateful con-sumer task (coloured in blue) which processes all the input data sequentially(unless the developer internally parallelises it). Rather than waiting for thecompletion of the producer task to process all its outputs, the consumer taskcan process the data as it is generated.
Figure 1: Representation of the same workflow using task-based and dataflow patterns (leftand right respectively).
The first contribution of this paper is the proposal of a single hybrid pro-gramming model capable of executing task-based workflows and dataflows si-multaneously. For that purpose, we extend task-based frameworks to supportcontinuous input and output data, and enable the combination of task-basedworkflows and dataflows (Hybrid Workflows from now on) using the same pro-gramming model. Moreover, it allows developers to build complex Data Sciencepipelines with different approaches depending on the requirements. The evalu-ation (Section 6) demonstrates that the use of Hybrid Workflows has significantperformance benefits when identifying some patterns in task-based workflows;e.g., when processing data continuously as it is generated, when removing syn-chronisation points, when processing external real-time data, and when com-bining task-based workflows and dataflows at different levels. Also, notice thatusing a single programming model frees the developers from the burden of de-ploying, using, and communicating different frameworks inside the same work-flow.The second contribution presented in this paper is a Distributed StreamingLibrary that can be easily integrated with existing task-based frameworks to4rovide support for dataflows. The library provides a homogeneous, generic, andsimple representation of a stream for enabling complex workflows to handle anykind of data without dealing directly with the streaming back-end. At its currentstate, the library supports file streams through a custom implementation, andobject streams through Kafka [2].To validate and evaluate our proposal, we have built a prototype that helpsus to illustrate the additional capabilities of Hybrid Workflows for complexData Science applications. This prototype extends COMPSs [3, 4], a mature,general-purpose programming model, and integrates the Distributed StreamingLibrary.The rest of the paper is organised as follows. Section 2 presents an overviewof the related work, and Section 3 introduces the baseline technology on whichour solution is built. Next, Section 4 describes the architecture of the Dis-tributed Stream Library and its integration with COMPSs. Section 5 detailsfour use cases to illustrate the new available features, and Section 6 evaluatesour proposal measuring performance improvements at application level and theadditional runtime overhead. Finally, Section 7 concludes the paper and givessome guidelines for future work.
2. Related Work
Nowadays, state-of-the-art frameworks typically focus on the execution ofeither task-based workflows or dataflows. Thus, next subsections provide ageneral overview of the most relevant frameworks for both task-based workflowsand dataflows. Furthermore, since our prototype combines both approaches intoa single programming model and allows developers to build Hybrid Workflowswithout deploying and managing two different frameworks, the last subsectiondetails other solutions and compares them with our proposal.
Although all the frameworks handle the tasks and data transfers transpar-ently, there are two main approaches to define task-based workflows. On the5ne hand, many frameworks force developers to explicitly define the applicationworkflow through a recipe file or a graphical interface. FireWorks [5, 6] definescomplex workflows using recipe files in Python, JSON, or YAML. It focuseson high-throughput applications, such as computational chemistry and mate-rials science calculations, and provides support arbitrary computing resources(including queue systems), monitoring through a built-in web interface, failuredetection, and dynamic workflow management. Taverna [7, 8] is a suite of toolsto design, monitor, and execute scientific workflows. It provides a graphical userinterface for the composition of workflows that are written in a Simple Concep-tual Unified Flow Language (Scufl) and executed remotely by the Taverna Serverto any underlying infrastructure (such as supercomputers, Grids or cloud en-vironments). Similarly, Kepler [9, 10] also provides a graphical user interfaceto compose workflows by selecting and connecting analytical components anddata sources. Furthermore, workflows can be easily stored, reused, and sharedacross the community. Internally, Kepler’s architecture is actor-oriented to al-low different execution models into the same workflow. Also, Galaxy [11, 12] is aweb-based platform for data analysis focused on accessibility and reproducibil-ity of workflows across the scientific community. The users define workflowsthrough the web portal and submit their executions to a Galaxy server contain-ing a full repertoire of tools and reference data. In an attempt to increase theinteroperability between the different systems and to avoid the duplication ofdevelopment efforts, Tavaxy [13] integrates Taverna and Galaxy workflows ina single environment; defining an extensible set of re-usable workflow patternsand supporting cloud capabilities. Although Tavaxy allows the composition ofworkflows using Taverna and Galaxy sub-workflows, the resulting workflow doesnot support streams nor any dataflow pattern.On the other hand, other frameworks implicitly build the task dependencygraph from the user code. Some opt for defining a new scripting languageto manage the workflow. These solutions force the users to learn a new lan-guage but make a clear differentiation between the workflow’s management (thescript) and the processes or programs to be executed. Nextflow [14, 15] enables6calable and reproducible workflows using software containers. It provides afluent DSL to implement and deploy workflows but allows the adaptation ofpipelines written in the most common scripting languages. Swift [16, 17] is aparallel scripting language developed in Java and designed to express and coor-dinate parallel invocations of application programs on distributed and parallelcomputing platforms. Users only define the main application and the input andoutput parameters of each program, so that Swift can execute the application inany distributed infrastructure by automatically building the data dependencies.Other frameworks opt for defining some annotations on top of an alreadyexisting language. These solutions avoid the users from learning a new lan-guage but merge the workflow annotations and its execution in the same files.Parsl [18] evolves from Swift and provides an intuitive way to build implicitworkflows by annotating ”apps” in Python codes. In Parsl, the developersannotate Python functions (apps) and Parsl constructs a dynamic, parallel exe-cution graph derived from the implicit linkage between apps based on shared in-put/output data objects. Parsl then executes apps when dependencies are met.Parsl is resource-independent, that is, the same Parsl script can be executed ona laptop, cluster, cloud, or supercomputer. Dask [19] is a library for parallelcomputing in Python. Dask follows a task-based approach being able to takeinto account the data-dependencies between the tasks and exploiting the inher-ent concurrency. Dask has been designed for computation and interactive datascience and integration with Jupyter notebooks. It is built on the dataframedata-structure that offers interfaces to NumPy, Pandas, and Python iterators.Dask supports implicit, simple, task-graphs previously defined by the system(Dask Array or Dask Bag) and, for more complex graphs, the programmer canrely in the delayed annotation that supports the asynchronous executions oftasks by building the corresponding task-graph. COMPSs [3, 20, 4] is a task-based programming model for the development of workflows/applications to beexecuted in distributed programming platforms. The task-dependency graph(or workflow) is generated at execution time and depends on the input data andthe dynamic execution of the application. Thus, compared with other workflow7ystems that are based on the static drawing of the workflow, COMPSs offers atool for building dynamic workflows, with all the flexibility and expressivity ofthe programming language.
Stream processing has become an increasingly prevalent solution to pro-cess social media and sensor devices data. On the one hand, many frame-works have been created explicitly to face this problem. Apache Flink [21] isstreaming dataflow engine to perform stateful computations over data streams(i.e., event-driven applications, streaming pipelines or stream analytics). Itprovides exactly-once processing, high throughput, automated memory man-agement, and advanced streaming capabilities (such as windowing). Flink usersbuild dataflows that start with one or more input streams (sources), performarbitrary transformations, and end in one or more outputs (sinks). ApacheSamza [22] allows building stateful applications for event processing or real-timeanalytics. Its differential point is to offer built-in support to process and trans-form data from many sources, including Apache Kafka, AWS Kinesis, AzureEventHubs, ElasticSearch, and HDFS. Samza users define a stream applicationthat processes messages from a set of input streams, transforms them by chain-ing multiple operators, and emits the results to output streams or stores. Also,Samza supports at-least-once processing, guaranteeing no data-loss even in caseof failures. Apache Storm [23] a is distributed real-time computation systembased on the master-worker architecture and used in real-time analytics, on-line machine learning, continuous computation, and distributed RPC, betweenothers. Storm users define topologies that consume streams of data (spouts)and process those streams in arbitrarily complex ways (bolts), re-partitioningthe streams between each stage of the computation however needed. AlthoughStorm natively provides at-least-once processing, it also supports exactly-onceprocessing via its high-level API called Trident. Twitter Heron [24] is a real-time, fault-tolerant stream processing engine. Heron was built as the Storm’ssuccessor, meaning that the topology concepts (spouts and bolts) are the same8nd has a compatible API with Storm. However, Heron provides better re-source isolation, new scheduler features (such as on-demand resources), betterthroughput, and lower latency.
Apache Spark [25] is a general framework for big data processing that wasoriginally designed to overcome the limitations of MapReduce [26]. Amongthe many built-in modules, Spark Streaming [27] is an extension of the Sparkcore to evolve from batch processing to continuous processing by emulatingstreaming via micro-batching. It ingests input data streams from many sources(e.g., Kafka, Flume, Kinesis, ZeroMQ) and divides them into batches that arethen processed by the Spark engine; allowing to combine streaming with batchqueries. Internally, the continuous stream of data is represented as a sequenceof RDDs in a high-level abstraction called Discretized Stream (DStream).Notice that Spark is based on high-level operators (operators on RDDs) thatare internally represented as a DAG; limiting the patterns of the applications.In contrast, our approach is based on sequential programming, which allowsthe developer to build any kind of application. Furthermore, micro-batching re-quires a predefined threshold or frequency before any processing occurs; whichcan be ”real-time” enough for many applications, but may lead to failures whenmicro-batching is simply not fast enough. In contrast, our solution uses a ded-icated streaming engine to handle dataflows; relying on streaming technologiesrather than micro-batching and ensuring that the data is processed as soon asit is available.On the other hand, other solutions combine existing frameworks to supportHybrid Workflows. Asterism [28] is a hybrid framework combining dispel4py and Pegasus at different levels to run data-intensive stream-based applicationsacross platforms on heterogeneous systems. The main idea is to represent thedifferent parts of a complex application as dispel4py workflows which are, then,orchestrated by Pegasus as tasks. While the stream-based execution is managedby dispel4py , the data movement between the different execution platforms9nd the workflow engine (submit host) is managed by Pegasus. Notice that As-terism can only handle dataflows inside task-based workflows ( dispel4py work-flows represented as Pegasus’ tasks), while our proposal is capable of orchestrat-ing nested task-flows, nested dataflows, dataflows inside task-based workflows,and task-based workflows inside dataflows.
3. Background
To the end of enabling the construction of Hybrid Workflows using the sameprogramming model, we decided to extend an already existing task-based pro-gramming model and enable the support for dataflows. Since we have chosenCOMPSs as the base workflow manager for our prototype, this section intro-duces the essential concepts for understanding its programming model and sup-porting runtime. The second part of this section briefly introduces Kafka sincethe default backend of the Distributed Streaming Library uses it to supportobject streams.
COMP Superscalar (COMPSs) is a programming model based on sequentialprogramming and designed to abstract developers away from the parallelisationand distribution details such as thread creation, synchronisation, data distri-bution, message passing or fault-tolerance. COMPSs is a task-based model;application developers select a set of methods whose invocations are consideredtasks that will run asynchronously in distributed nodes.As shown in Figure 2, Java is the native programming language to de-velop COMPSs applications; however, it also provides bindings for Python (Py-COMPSs [29]) and C/C++ [30]. Its programming model is based on annotationsthat are used to choose class and object methods as tasks. These annotationscan be split into two groups: • Method Annotations:
Annotations added to the sequential code meth-ods to define them as tasks and potentially execute them in parallel.10
Parameter Annotations:
Annotations added to the parameters of anannotated method to indicate the direction (IN, OUT, INOUT) of thedata used by a task.
Figure 2: COMPSs overview.
The runtime [20] system supporting the model follows a master-worker ar-chitecture. The master node, on which the main code of the application runs,orchestrates the execution of the applications and its tasks on the underlyinginfrastructure – described in an XML configuration file. For that purpose, itintercepts calls to methods annotated as tasks and, for each detected call, itanalyses the data dependencies with previous tasks according to defined param-eter annotations. As the result of this analysis, the runtime builds a DirectedAcyclic Graph (DAG) where nodes represent tasks and edges represent data de-pendencies between them; thus, the runtime infers the application parallelism.The master node transparently schedules and submits a task execution on aworker node handling the required data transfers. If a partial failure raises dur-ing a task execution, the master node handles it with job re-submission andre-schedule techniques.COMPSs guarantees the portability across different computing platformssuch as clusters, supercomputing machines, clouds, or container-managed in-frastructures without modifying the application code. Also, the runtime allowsthe usage of third-party plugins by implementing a simple interface to extendits usage to new infrastructures or change the scheduling policy.11urthermore, the COMPSs framework also provides a live-monitoring toolthrough a built-in web interface. For further details on executions, users canenable the instrumentation of their application using Extrae [31] and generatepost-mortem traces that can be analysed with Paraver [32].
A COMPSs application in Java is composed of three parts: • Main application:
Sequential code defining the workflow of the applica-tion. It must contain calls to class or object methods annotated as tasksso that, at execution time, they can be asynchronously executed in theavailable resources. • Remote Methods:
Code containing the implementation of the tasks.This code can be in the same file than the application’s main code or inone or more separate files. • Annotated Interface:
List of annotated methods that can be remotelyexecuted as tasks. It contains one entry per task defining the
Method An-notation , the object or class method name, and one
Parameter Annotation per method parameter.Notice that COMPSs Java applications do not require any special API,pragma or construct since COMPSs instruments the application’s code at ex-ecution time to detect the tasks defined in the annotated interface. Hence,the COMPSs annotations do not interfere with the applications’ code, and allapplications can be tested sequentially.
The Python syntax in COMPSs is supported through a binding: PyCOMPSs.This Python binding is supported by a Binding-commons layer which focuses onenabling the functionalities of the runtime to other languages (currently, Pythonand C/C++). It has been designed as an API with a set of defined functions.12t is written in C and performs the communication with the runtime throughthe JNI [33].In contrast with the Java syntax, all PyCOMPSs annotations are done inline.The Method Annotations are in the form of Python decorators. Hence, the userscan add the @task decorator on top of a class or object method to indicate thatits invocations will become tasks at execution time. Furthermore, the ParameterAnnotations are contained inside the Method Annotation.Listing 1 shows an example of a task annotation. The first line contains thetask annotation in the form of a Python decorator while the rest of the code isa regular python function. The parameter c is of type INOUT, and parameters a , and b are set to the default type IN. The directionality tags are used atexecution time to derive the data dependencies between tasks and are appliedat an object level, taking into account its references to identify when two tasksaccess the same object. @task(c=INOUT) def multiply(a, b, c): c += a * b Listing 1: PyCOMPSs Task annotation example.
A tiny synchronisation API completes the PyCOMPSs syntax. For instance,as shown in Listing 2, the compss_wait_on waits until all the tasks modifyingthe result ’s value are finished and brings the value to the node executing themain program (line 4). Once the value is retrieved, the execution of the mainprogram code is resumed. Given that PyCOMPSs is mostly used in distributedenvironments, synchronising may imply a data transfer from remote storage ormemory space to the node executing the main program. for block in data: presult = word_count(block) reduce_count(result, presult) final_result = compss_wait_on(result) Listing 2: PyCOMPSs synchronisation API example.
Similarly, the API includes a compss_open(file_name, mode='r') to syn-13hronise files, and a compss_barrier() to explicitly wait for the completion ofall the previous tasks.
Figure 3 illustrates the basic concepts in Kafka and how they relate to eachother.
Records – each blue box in the figure – are key-value pairs containingapplication-level information registered along with its publication time.
Figure 3: Description of Kafka’s basic concepts.
Kafka users define several categories or topics to which records belong. Kafkarelies on ZooKeeper [34, 35] to store each topic as a partitioned log with anarbitrary number of partitions and maintains a configurable number of par-tition replicas across the cluster to provide fault tolerance and record-accessparallelism. Each partition contains an immutable, publication-time-orderedsequence of records each uniquely identified by a sequential id number knownas the offset of the record. The example in the figure defines two topics (
TopicA and
Topic B ) with 2 and 3 partitions, respectively.14inally,
Producers and
Consumers are third-party application componentsthat interact with Kafka to publish and retrieve data. The former add newrecords to the topics of their choice, while the latter subscribe to one or moretopics for receiving records related to them. Consumers can join in
Consumergroups . Kafka ensures that each record published to a topic is delivered toat least one consumer instance within each subscribing group; thus, multipleprocesses on remote machines can share the processing of the records of thattopic. Although most often delivered exactly once, records might duplicatewhen one consumer crashes without a clean shutdown and another consumerwithin the same group takes over its partitions.Back to the example in the figure,
Producer A publishes one record to
TopicA , and
Producer B publishes two records, one to
Topic A and one to
Topic B . Consumer A , with a group of its own, processes all the records in
Topic A and
Topic B . Since
Consumer B and
Consumer C belong to the same consumergroup, they share the processing of all the records from
Topic B .Besides the Consumer and Producer API, Kafka also provides the StreamProcessor and Connector APIs. The former, usually used in the intermediatesteps of the fluent stream processing, allows application components to consumean input stream from one or more topics and produce an output stream to oneor more topics. The latter is used for connecting producers and consumers toalready existing applications or data systems. For instance, a connector to adatabase might capture every change to a table.
4. Architecture
Figure 4 depicts a general overview of the proposed solution. When exe-cuting regular task-based workflows, the application written following the pro-gramming model interacts with the runtime to spawn the remote execution oftasks and retrieve the desired results. Our proposal includes a representation ofa stream (
DistroStream
Interface) that provides applications with homogeneousstream accesses regardless of the stream backend supporting them. Moreover,15e extend the programming model and runtime to provide task annotations andscheduling capabilities for streams.
Figure 4: General architecture.
The following subsections discuss the architecture of the proposed solutionin a bottom-up approach, starting from the representation of a stream (
Dis-troStream
API) and its implementations. Next, we describe the
DistributedStream Library and its internal components. Finally, we detail the integra-tion of this library with the programming model (COMPSs) and the necessaryextensions of its runtime system.
The Distributed Stream is a representation of a stream used by applicationsto publish and receive data values. Its interface provides a common API toguarantee homogeneity on all interactions with streams.As shown in Listing 3, the
DistroStream interface provides a publish method for submitting a single message or a list of messages (lines 5 and 6)and a poll method to retrieve all the currently available unread messages (lines9 and 10). Notice that the latter has an optional timeout parameter (in millisec-onds) to wait until an element becomes available or the specified time expires.Moreover, the streams can be created with an optional alias parameter (line 2)to allow different applications to communicate through them. Also, the inter-16 // INSTANTIATION public DistroStream(String alias) throws RegistrationException; // PUBLISH METHODS public abstract void publish(T message) throws BackendException; public abstract void publish(List
Figure 5: DistroStream class relationship.
As shown in Figure 5, two different implementations of the
DistroStream
API provide the specific logic to support object and file streams. Object streamsare suitable when sharing data within the same language or framework. On the17ther hand, file streams allow different frameworks and languages to share data.For instance, the files generated by an MPI simulation in C or Fortran can bereceived through an stream and processed in a Python or Java application.
ObjectDistroStream (ODS) implements the generic
DistroStream inter-face to support object streams. Each ODS has an associated
ODSPublisher and
ODSConsumer that interact appropriately with the software handling themessage transmission (streaming backend). The ODS instantiates them uponthe first invocation of a publish or a poll method respectively. This behaviourguarantees that the same object stream has different publisher and consumerinstances when accessed from different processes, and that the producer andconsumer instances are only registered when required, avoiding unneeded regis-trations on the streaming backend.At its current state, the available implementation is backed by Kafka, butthe design is prepared to support many backends. Notice that the ODS, the
ODSPublisher , and the
ODSConsumer are just abstractions to hide the interac-tion with the underlying backend. Hence, any other backend (such as an MQTTbroker) can be supported without any modification at the workflow level by im-plementing the functionalities defined in these abstractions.Considering the Kafka concepts introduced in Section 3.2, each ODS becomesa Kafka topic named after the stream id. When created, the
ODSPublisher in-stantiates a
KafkaProducer whose publish method builds a new
ProducerRecord and submits it to the corresponding topic via the
KafkaProducer.send method.If the publish invocation sends several messages, the
ODSPublisher iterativelyperforms the publishing process for each message so that Kafka registers it asseparated records.Likewise, a new
KafkaConsumer is instantiated along with an
ODSConsumer .Then, the
KafkaConsumer is registered to a consumer group shared by all theconsumers of the same application to avoid replicated messages, and subscribedto the topic named after the id of the stream. Hence, the poll method retrieves18 list of
ConsumerRecords and deserialises their values. To ensure that recordsare processed exactly-once, consumers also interact with Kafka’s
AdminClient to delete all the processed records from the database. // PRODUCER void produce(List
Listing 4 shows an example using object streams in Java. Notice that thestream creation (line 5) forces all the stream objects to be of the same type T .Internally, the stream serialises and deserialises the objects so that the applica-tion can publish and poll elements of type T directly to/from the stream. Aspreviously explained, the example also shows the usage of the publish methodfor a single element (line 12) or a list of elements (line 14), the poll methodwith the optional timeout parameter (lines 23 and 27, respectively), and thecommon API calls to close the stream (line 16), check its status (line 22), andretrieve metadata information (lines 7 to 9).Due to space constraints, the example only shows the ODS usage in Java,19ut our prototype provides an equivalent implementation in Python. The
FileDistroStream implementation (FDS) backs the
DistroStream upto support the streaming of files. Like ODS, its design allows using differentbackends; however, at its current state, it uses a custom implementation thatmonitors the creation of files inside a given directory. The
Directory Monitor backend sends the file locations through the stream and relies on a distributedfile system to share the file content. Thus, the monitored directory must beavailable to every client on the same path. // PRODUCER void produce(String baseDir, List
Listing 5 shows an example using file streams in Java. Notice that the FDSinstantiation (line 5 in the listing) requires a base directory to monitor thecreation of files and that it optionally accepts an alias argument to retrieve thecontent of an already existing stream. Also, files are not explicitly published on20he stream since the base directory is automatically monitored (lines 8 to 13).Instead, regular methods to write files are used. However, the consumer mustexplicitly call the poll method to retrieve a list of the newly available file pathsin the stream (lines 22 and 26). As with ODS, applications can also use thecommon API calls to close the stream (line 15), check its status (lines 21 and25), and retrieve metadata information.Due to space constraints, the example only shows the FDS usage in Java,but our prototype provides an equivalent implementation in Python.
The Distributed Stream Library (
DistroStreamLib ) handles the stream ob-jects and provides three major components. First, the
DistroStream
API andimplementations described in the previous sections.Second, the library provides the DistroStream Client that must be availablefor each application process. The client is used to forward any stream metadatarequest to the DistroStream Server or any stream data access to the suitablestream backend (i.e.,
Directory Monitor , or
Kafka ). To avoid repeated queriesto the server, the client stores the retrieved metadata in a cache-like fashion.Either the Server or the backend can invalidate the cached values.Third, the library provides the
DistroStream Server process that is uniquefor all the applications sharing the stream set. The server maintains a registryof active streams, consumers, and producers with the purpose of coordinatingany stream data or metadata access. Among other responsibilities, it is incharge of assigning unique ids to new streams, checking the access permissionsof producers and consumers when requesting publish and poll operations, andnotifying all registered consumers when the stream has been completely closedand there are no producers remaining.Figure 6 contains a sequence diagram that illustrates the interaction of thedifferent Distributed Stream Library components when serving a user petition.The
DistroStream implementation used by the applications always forwardsthe requests to the
DistroStream Client available on the process. The client21 igure 6: Sequence diagram of the Distributed Stream Library components. communicates with the
DistroStream Server for control purposes, and retrievesfrom the backend the real data.
As already mentioned in Section 3.1, the prototype to evaluate Hybrid Work-flows is based on the COMPSs workflow manager. At programming-modellevel, we have extended the COMPSs Parameter Annotation to include a new
STREAM type.As shown in Listing 6, on the one hand, the users declare producer tasks(methods that write data into a stream) by adding a parameter of type STREAMand direction OUT (lines 3 to 7 in the listing). On the other hand, the usersdeclare consumer tasks (methods that read data from a stream) by adding aparameter of type STREAM and direction IN (lines 9 to 13 in the listing). Inthe current design, we have not considered INOUT streams because we do notimagine a use case where the same method writes data into its own stream.However, it can be easily extended to support such behaviour when required.22 public interface Itf { @Method(declaringClass = "Producer") Integer sendMessages( @Parameter(type = Type.STREAM, direction = Direction.OUT) DistroStream stream ); @Method(declaringClass = "Consumer") Result receiveMessages( @Parameter(type = Type.STREAM, direction = Direction.IN) DistroStream stream ); } Listing 6: Stream parameter annotation example in Java.
Furthermore, we want to highlight that this new annotation allows inte-grating streams smoothly with any other previous annotation. For instance,Listing 7 shows a single producer task that uses two parameters: a streamparameter typical of dataflows (lines 5 and 6) and a file parameter typical oftask-based workflows (lines 7 and 8). public interface Itf { @Method(declaringClass = "Producer") Integer sendMessages( @Parameter(type = Type.STREAM, direction = Direction.OUT) DistroStream stream, @Parameter(type = Type.FILE, direction = Direction.IN) String file ); } Listing 7: Example combining stream and file parameters in Java.
As depicted in Figure 7, COMPSs registers the different tasks from the ap-plication’s main code through the Task Analyser component. Then, it buildsa task graph based on the data dependencies and submits it to the
Task Dis-patcher . The Task Dispatcher interacts with the Task Scheduler to schedule thedata-free tasks when possible and, eventually, submit them to execution. The23 igure 7: Structure of the internal COMPSs components. execution step includes the job creation, the transfer of the input data, the jobtransfer to the selected resource, the real task execution on the worker, and theoutput retrieval from the worker back to the master. If any of these steps fail,COMPSs provides fault-tolerant mechanisms for partial failures. Also, once thetask has finished, COMPSs stores the monitoring data of the task, synchronisesany data required by the application, releases the data-dependent tasks so thatthey can be scheduled, and deletes the task.Therefore, the new Stream annotation has forced modifications in the TaskAnalyser and Task Scheduler components. More specifically, notice that astream parameter does not define a traditional data dependency between aproducer and a consumer task since both tasks can run at the same time. How-ever, there is some information that must be stored so that the Task Schedulercan correctly handle the available resources and the data locality. In this sense,when using the same stream object, our prototype prioritises producer tasksover consumer tasks to avoid wasting resources when a consumer task is wait-ing for data to be produced by a non-running producer task. Moreover, theTask Scheduler assumes that the resources that are running (or have run) pro-ducer tasks are the data locations for the stream. This information is used toschedule the consumer tasks accordingly and minimise as much as possible thedata transfers between nodes. 24 igure 8: COMPSs and Distributed Stream Library deployment.
Regarding the components’ deployment, as shown in Figure 8, the COMPSsmaster spawns the
DistroStream Server and the required backend. Furthermore,it includes a
DistroStream Client to handle the stream accesses and requests per-formed on the application’s main code. On the other hand, the COMPSs workersonly spawn a
DistroStream Client to handle the stream accesses and requestsperformed on the tasks. Notice that the COMPSs master-worker communicationis done through NIO [36], while the
DistroStream Server-Client communicationis done through Sockets.
5. Use Cases
Enabling Hybrid task-based workflows and dataflows into a single program-ming model allows users to define new types of complex workflows. We introducefour patterns that appear in real-world applications so that the users identify-ing these patterns in their workflows can benefit from the new capabilities andperformance improvements of Hybrid Workflows. Next subsections provide in-depth analysis of each use case.
One of the main drawbacks of task-based workflows is waiting for task com-pletion to process its results. Often, the output data is generated continuously25uring the task execution rather than at the end. Hence, enabling data streamsallows users to process the data as it is generated. @constraint(computing_units=CORES_SIMULATION) @task(varargs_type=FILE_OUT) def simulation(num_files, *args): ... @constraint(computing_units=CORES_PROCESS) @task(input_file=FILE_IN, output_image=FILE_OUT) def process_sim_file(input_file, output_image): ... @constraint(computing_units=CORES_MERGE) @task(output_gif=FILE_OUT, varargs_type=FILE_IN) def merge_reduce(output_gif, *args): ... def main(): num_sims, num_files, sim_files, output_files, output_gifs = ... for i in range(num_sims): simulation(num_files, *sim_files[i]) for i in range(num_sims): for j in range(num_files): process_sim_file(sim_files[i][j], output_images[i][j]) for i in range(num_sims): merge_reduce(output_gifs[i], *output_images[i]) for i in range(num_sims): output_gifs[i] = compss_wait_on_file(output_gifs[i]) Listing 8: Simulations’ application in Python without streams.
For instance, Listing 8 shows the code of a pure task-based application thatlaunches num_sims simulations (line 21). Each simulation produces output filesat different time steps of the simulation (i.e., an output file every iteration ofthe simulation). The results of these simulations are processed separately bythe process_sim_file task (line 25) and merged to a single GIF per simulation(line 28). The example code also includes the task definitions (lines 1 to 13)and the synchronisation API calls to retrieve the results (line 31).Figure 9 shows the task graph generated by the previous code when runningwith 2 simulations ( num_sims ) and 5 files per simulation ( num_files ). The simulation tasks are shown in blue, the process_sim_file in white and red,26nd the merge_reduce in pink. Notice that the simulations and the process-ing of the files cannot run in parallel since the task-based workflow forces thecompletion of the simulation tasks to begin any of the processing tasks.
Figure 9: Task graph of the simulation application without streaming.
On the other hand, Listing 9 shows the code of the same application usingstreams to retrieve the data from the simulations as it is generated and for-warding it to its processing tasks. The application initialises the streams (lines20 to 22), launches num_sims simulations (line 25), spawns a process task foreach received element in each stream (line 34), merges all the output files intoa single GIF per simulation (line 37), and synchronises the final results. The process_sim_file and merge_reduce task definitions are identical to the pre-vious example. Conversely, the simulation task definition uses the
STREAM_OUT annotation to indicate that one of the parameters is a stream where the taskis going to publish data. Also, although the simulation, merge, and synchroni-sation phases are very similar to the pure task-based workflow, the processingphase is completely different (lines 27 to 34). When using streams, the main codeneeds to check the stream status, retrieve its published elements, and spawn a process_sim_task per element. However, the complexity of the code does notincrease significantly when adding streams to an existing application.Figure 10 shows the task graph generated by the previous code when run-ning with the same parameters than the pure task-based example (2 simulationsand 5 files per simulation). The colour code is also the same than the previous27 @constraint(computing_units=CORES_SIMULATION) @task(fds=STREAM_OUT) def simulation(fds, num_files): ... @constraint(computing_units=CORES_PROCESS) @task(input_file=FILE_IN, output_image=FILE_OUT) def process_sim_file(input_file, output_image): ... @constraint(computing_units=CORES_MERGE) @task(output_gif=FILE_OUT, varargs_type=FILE_IN) def merge_reduce(output_gif, *args): ... def main(): num_sims, num_files, output_images, output_gifs = ... input_streams = [None for _ in range(num_sims)] for i in range(num_sims): input_streams[i] = FileDistroStream(base_dir=stream_dir) for i in range(num_sims): simulation(input_streams[i], num_files) for i in range(app_args.num_simulations): while not input_streams[i].is_closed(): new_files = input_streams[i].poll() for input_file in new_files: output_image = input_file + ".out" output_images[i].append(output_image) process_sim_file(input_file, output_image) for i in range(app_args.num_simulations): merge_reduce(output_gifs[i], *output_images[i]) for i in range(app_args.num_simulations): output_gifs[i] = compss_wait_on_file(output_gifs[i]) Listing 9: Simulations’ application in Python with streams. example: the simulation tasks are shown in blue, the process_sim_file inwhite and red, and the merge_reduce in pink. Notice that streams enable theexecution of the processing tasks while the simulations are still running; poten-tially reducing the total execution time and increasing the resources utilisation(see Section 6.2 for further details). 28 igure 10: Task graph of the simulation application with streaming.
Streams can also be used to communicate data between tasks without wait-ing for the tasks’ completion. This technique can be useful when performingparameter sweep, cross-validation, or running the same algorithm with differentinitial points.
Figure 11: Task graph of the multi-simulations application.
For instance, Figure 11 shows three algorithms running simultaneously thatexchange control data at the end of every iteration. Notice that the data ex-change at the end of each iteration can be done synchronously by stopping all thesimulations, or asynchronously by sending the updated results and processing29he pending messages in the stream (even though some messages of the actualiteration might be received in the next iteration). Furthermore, each algorithmcan run a complete task-based workflow to perform the iteration calculus ob-taining a nested task-based workflow inside a pure dataflow.
Many applications receive its data continuously from external streams (i.e.,IoT sensors) that are not part of the application itself. Moreover, depending onthe workload, the stream data can be produced by a single task and consumedby many tasks (one to many), produced by many tasks and consumed by a singletask (many to one), or produced by many tasks and consumed by many tasks(many to many). The
Distributed Stream Library supports all three scenariostransparently, and allows to configure the consumer mode to process the dataat least once, at most once, or exactly once when using many consumers.
Figure 12: Task graph of the sensor application.
Figure 12 shows an external sensor (
Stream 1 in the figure) producing data30hat is filtered simultaneously by 4 tasks (coloured in white). The relevant datais then extracted from an internal stream (
Stream 2 ) by an intermediate task(task 6, coloured in red), and used to run a task-based algorithm. The result is ahybrid task-based workflow and dataflow. Also, the sensor uses a one-to-manystream configured to process the data exactly once, and the filter (colouredin white) tasks use a many-to-one stream to publish data to the extract task(coloured in red).
Our proposal also allows to combine task-based workflows and dataflows atdifferent levels; having nested task-based workflows inside a dataflow task orvice-versa. This feature enables the internal parallelisation of tasks, allowingworkflows to scale up and down resources depending on the workload.
Figure 13: Task graph of the hybrid nested application.
For instance, Figure 13 shows a dataflow with two nested task-based work-flows. The application is similar to the previous use case: the task 1 (coloured31n pink) produces the data, task 2 (in white) filters it, task 3 (in blue) extractsand collects the data, and task 4 (in red) runs a big computation.Notice that, in the previous use case, the application always has 4 filter tasks.However, in this scenario, the filter task has a nested task-based workflow thataccumulates the received data into batches and spawns a new filter task perbatch. This technique dynamically adapts the resource usage to the amountof data received by the input stream. Likewise, the big computation task alsocontains a nested task-based workflow. This shows that users can parallelisesome computations internally without modifying the original dataflow.
6. Evaluation
This section evaluates the performance of the new features enabled by ourprototype when using data streams against their equivalent implementationsusing task-based workflows. Furthermore, we analyse the stream writer andreader processes’ scalability and load balancing. Finally, we provide an in-depthanalysis of the COMPSs runtime performance by comparing the task analysis,task scheduling, and task execution times when using pure task-based workflowsor streams.
The results presented in this section have been obtained using the MareNos-trum 4 Supercomputer [37] located at the Barcelona Supercomputing Center(BSC). Its current peak performance is 11.15 Petaflops. The supercomputer iscomposed by 3456 nodes, each of them with two Intel ® Xeon Platinum 8160 (24cores at 2,1 GHz each). It has 384.75 TB of main memory, 100Gb Intel ® Omni-Path Full-Fat Tree Interconnection, and 14 PB of shared disk storage managedby the Global Parallel File System.Regarding the software, we have used DistroStream Library (available at [38]),COMPSs version 2.5.rc1909 (available at [39]), and Kafka version 2.3.0 (avail-able at [40]). We have also used Java OpenJDK 8 131, Python 2.7.13, GCC7.2.0, and Boost 1.64.0. 32 .2. Gain of processing data continuously
As explained in the first use case in Section 5.1, one of the significant advan-tages when using data streams is to process data continuously as it is generated.For that purpose, Figure 14 compares the Paraver [32] traces of the originalCOMPSs execution (pure task-based workflow) and the execution using HybridWorkflows. Each trace shows the available threads in the vertical axis and theexecution time in the horizontal axis - 36 s in both traces. Also, each colourrepresents the execution of a task type; corresponding to the colours shown inthe task graphs of the first use case (see Section 5.1). The green flags indicatewhen a simulation has generated all its output files and has closed its associatedwriting stream. Both implementations are written in Python, and the DirectoryMonitor is set as stream backend. (a) Original COMPSs execution(b) Execution with Hybrid Workflows
Figure 14: Paraver traces to illustrate the gain of processing data continuously.
In contrast to the original COMPSs execution (top), the execution usingHybrid Workflows (bottom) executes the processing tasks (white and red) whilethe simulations (blue) are still running; significantly reducing the total execu-tion time and increasing the resources utilisation. Moreover, the merge_reduce tasks (pink) are able to begin its execution even before the simulation tasksare finished, since all the streams have been closed and the process_sim_file tasks have already finished.In general terms, the gain of the implementation using Hybrid Workflowswith respect to the original COMPSs implementation (calculated following Equa-33ion 1) is proportional to the number of tasks that can be executed in parallelwhile the simulation is active. Therefore, we perform an in-depth analysis ofthe trade-off between the generation and process times. It is worth mentioningthat we define the generation time as the time elapsed between the generationof two elements of the simulation. Hence, the total duration of the simulationis the generation time multiplied by the number of generated elements. Also,the process time is defined as the time to process a single element (that is, theduration of the process_sim_file task).
Gain = ExecutionT ime original − ExecutionT ime hybrid
ExecutionT ime original (1)The experiment uses 2 nodes of 48 cores each. Since the COMPSs masterreserves 12 cores, there are two available workers with 36 and 48 cores respec-tively. The simulation is configured to use 48 cores, leaving 36 available coreswhile it is active and 84 available cores when it is over. Also, the process tasksare configured to use one single core.
Figure 15: Average execution time and gain of a simulation with increasing generation time.
Figure 15 depicts the average execution time of 5 runs where each simula-tion generates 500 elements. The process time is fixed to 60,000 ms, while thegeneration time between stream elements varies from 100 ms to 2,000 ms. Forshort generation times, almost all the processing tasks are executed when the34eneration task has already finished, obtaining no gain with respect to the imple-mentation with objects. For instance, when generating elements every 100 ms,the simulation takes 50,000 ms in total (500 elements · ms/element ). Sincethe process tasks last 60,000 ms, none of them will have finished before thesimulation ends; leading to almost no gain.When increasing the generation time, more and more tasks can be executedwhile the generation is still active; achieving a 19% gain when generating streamelements every 500 ms. However, the gain is limited because the last generatedelements are always processed when the simulation is over. Therefore, increasingthe generation time from 500 ms to 2,000 ms only raises the gain from 19% to23%. Figure 16: Average execution time and gain of a simulation with increasing process time.
On the other hand, Figure 16 illustrates the average execution time of 5runs that generate 500 process tasks with a fixed generation time of 100 ms anda process time varying from 5,000 ms up to 60,000 ms. Notice that the totalsimulation time is 50,000 ms. When the processing time is short, many taskscan be executed while the generation is still active; achieving a maximum 23%gain when the processing time is 5,000ms.As with the previous case, when the processing time is increased, the numberof tasks that can be executed while the generation is active also decreases and,thus, the gain. Also, the gain is almost zero when the processing time is big35nough (60,000ms) so that none of the process tasks will have finished beforethe generation ends.
Many workflows are composed of several iterative computations runningsimultaneously until a certain convergence criterion is met. As described inSection 5.2, this technique is useful when performing parameter sweep, cross-validation, or running the same algorithm with different initial points.
Figure 17: Parallel iterative computations. Pure task-based workflow and Hybrid Workflowshown at left and right, respectively.
To this end, each computation requires a phase at the end of each iteration toexchange information with the rest. When using pure task-based workflows, thisphase requires to stop all the computations at the end of each iteration, retrieveall the states, create a task to exchange and update all the states, transferback all the new states, and resume all the computations to the next iteration.The left task graph of Figure 17 shows an example of such workflows with twoiterations of two parallel computations. The first two red tasks initialise thestate of each computation, the pink tasks perform the computation of eachiteration, and the blue tasks retrieve and update the state of each computation.36onversely, when using Hybrid Workflows, each computation can exchangethe information at the end of each iteration asynchronously by writing andreading the states to/from streams. This technique avoids splitting each com-putation into tasks, stopping and resuming each computation at every iteration,and synchronising all the computations to exchange data. The right task graphof Figure 17 depicts the equivalent Hybrid Workflow of the previous example.Each computation is run in a single task (white) that performs the state initial-isation, all the iterations, and all the update phases at the end of each iteration.
Figure 18: Average execution time and gain of a simulation with an increasing number ofiterations.
Using the previous examples, Figure 18 evaluates the performance gain ofavoiding the synchronisation and exchange process at the end of each iteration(calculated following Equation 2). Hence, the benchmark executes the puretask-based workflow (blue) and the Hybrid Workflow (green) versions of thesame workflow written in Java and using Kafka as streaming backend. Also,it is composed of two independent computations with a fixed computation periteration (2,000 ms) and an increasing number of iterations. The results shownare the mean execution times of 5 runs of each configuration.
Gain = ExecutionT ime pure task − based − ExecutionT ime hybrid
ExecutionT ime pure task − based (2)37otice that the total gain is influenced by three factors: the removal of thesynchronisation task at the end of each iteration, the cost of transferring thestate between the process and the synchronisation tasks, and, the division ofthe state’s initialisation and process. Although we have reduced the state ofeach computation to 24 bytes and used a single worker machine to minimise theimpact of the transfer overhead, the second and third factors become importantwhen running a small number of iterations (below 32), reaching a maximumgain of 42% when running a single iteration. For a larger number of iterations(over 32), the removal of the synchronisation becomes the main factor, and thetotal gain reaches a steady state with a gain around 33%. Our prototype supports N-M streams, meaning that any stream can have anarbitrary amount of writers and readers. To evaluate the performance and loadbalance, we have implemented a Java application that uses a single stream andcreates N writer tasks and M reader tasks. Although our writer and reader tasksuse a single core, we spawn each of them in separated nodes so that the datamust be transferred. In more sophisticated use cases, each task could benefitfrom an intra-node technology (such as OpenMP) to parallelise the processingof the stream data.
Figure 19: Average execution time and efficiency (left and right, respectively) with increasingnumber of readers and different number of writers.
Figure 19 depicts the average execution time (left) and the efficiency (right)of 5 runs with an increasing number of readers. Each series uses a different38umber of writers, also going from 1 to 8. Also, the writers publish 100 elementsin total, the size of the published objects is 24 bytes, and the time to process anelement is set to 1,000ms. The efficiency is calculated using the ideal executiontime as reference; i.e. the number of elements multiplied by the time to processan element and divided by the number of readers.Since the execution time is mainly due to the processing of the elements inthe reader tasks, when increasing the number of writers, there are no signifi-cant differences. However, for all the cases, increasing the number of readerssignificantly impacts the execution time, achieving a 4.84 speed-up with 8 read-ers. Furthermore, the efficiencies using 1 reader are close to the ideal (87%on average) because the only overheads are the creation of the elements, thetask spawning, and the data transfers. However, when increasing the numberof readers, the load imbalance significantly affects efficiency; achieving around50% efficiency with 8 readers.
Figure 20: Number of stream elements processed per reader.
It is worth mentioning that the achieved speed-up is lower than the ideal (8)due to load imbalance. Thus, the elements processed by each reader task arenot balanced since elements are assigned to the first process that requests them.Figure 20 illustrates an in-depth study of the load imbalance when running 1,2, 4, or 8 readers. Notice that when running with 2 readers, the first reader getsalmost 75% of the elements while the second one only processes 25% of the totalload. The same pattern is shown when increasing the number of readers; wherehalf of the tasks perform 70% of the total load. For instance, when runningwith 4 readers, 2 tasks perform 69% of the work (34.5% each), while the restonly performs 31% of the total load (15.5% each). Similarly, when running with39 readers, 4 tasks perform 70% of the total load (17.5% each), while the otherfour only process 30% (7.5% each).At its current state, the
Distributed Stream Library does not implement anyload balance technique, nor limit the number of elements retrieved by each poll call. As future work, since the library already stores the processes registeredto each stream, it could implement some policy and send only a subset of theavailable elements to the requesting process rather than all of them.
To provide a more in-depth analysis of the performance of our prototype, wehave compared each step of the task life-cycle when using
ObjectParameter (OPfrom now on) or
StreamParameter (SP from now on) that use
ObjectDistro-Streams . The following figures evaluate the task analysis, task scheduling, andtask execution average times of 100 tasks using (a) a single object of increasingsize (from 1 MB to 128 MB) or (b) an increasing number of objects (from1 to 16 objects) of fixed size (8 MB). Both implementations are written inJava and Kafka is used as the stream backend. Regarding the task definition,notice that the OP implementation requires an
ObjectParameter for each objectsent to the task. In contrast, the SP implementation only requires a single
StreamParameter since all the objects are sent through the stream itself.
Figure 21: Task analysis average time for one single parameter with increasing sizes (left) orincreasing number of parameters (right).
Figure 21 compares the task analysis results. The task analysis is the timespent by the runtime to register the task and its parameters into the system. Itis worth mentioning that increasing the object’s size does not affect the anal-40sis time in either the OP nor the SP implementations. There is, however, adifference around 0.05 ms due to the creation of internal structures to representobject parameters or stream parameters.On the other hand, increasing the number of objects directly affects thetask analysis time because the runtime needs to register each task parameterindividually. For the OP implementation, each object maps to an extra taskparameter, and thus, the task analysis time slightly increases when increasingthe number of objects. Conversely, for the SP implementation, the streamparameter itself is not modified since we only increase the number of publishedobjects. Hence, the task analysis time remains constant when increasing thenumber of objects.
Figure 22: Task scheduling average time for one single parameter with increasing sizes (left)or increasing number of parameters (right).
Figure 22 compares the task scheduling results. On the one hand, thescheduling time for both implementations varies from 2.05 ms to 2.20 ms butdoes not show any clear tendency to increase when increasing the object’s size.On the other hand, when increasing the number of objects, the scheduling timeincreases for the OP implementation and remains constant for the SP implemen-tation. This behaviour is due to the fact that the default COMPSs schedulerimplements data locality and, thus, the scheduling time is proportional to thenumber of parameters. Similarly to the previous case, increasing the numberof objects increases the number of task parameters for the OP implementa-tion (increasing its scheduling time), but keeps a single parameter for the SPimplementation (maintaining its scheduling time).41 igure 23: Task execution average time for one single parameter with increasing sizes (left)or increasing number of parameters (right).
Figure 23 compares the task execution results. The task execution timecovers the transfer of all the task parameters and the task execution itself.Regarding the SP implementation, the time remains constant around 208 msregardless of the object’s size and the number of objects because the measure-ment only considers the transfer of the stream object itself and the executiontime of the poll method. It is worth mentioning that the actual transfers ofthe objects are done by Kafka when invoking the publish method on the maincode, and thus, they are executed in parallel while COMPSs spawns the task inthe worker machine.Conversely, the execution time for the OP implementation increases withboth the object’s size and the number of objects, since the serialisation andtransfer times also increase. However, the task execution does not need to fetchthe objects (the poll method) since all of them have already been transferred.This trade-off can be observed in the figure, where the OP implementationperforms better than the SP implementation when using task parameters smallerthan 48 MB and performs worse for bigger cases. Notice that only the totalobjects’ size is relevant since the same behaviour is shown when using a single48 MB object or 6 objects of 8 MB each.Since the real object transfers when using SP are executed during the publish method and cannot be observed measuring the task execution time, we have alsomeasured the total execution time of the benchmark for both implementations.Figure 24 shows the total execution time with an increasing number of objects of42 MB. In contrast to the previous plot, both implementations have an increasingexecution time proportional to the objects’ size. Also, the SP implementationonly outperforms the OP implementation when using more than 12 objects.
Figure 24: Total execution time with increasing number of parameters.
To conclude, since there are no major differences regarding the task analysistime nor the task scheduling time, we can safely assume that the use of streamsinstead of regular objects is recommended when the total size of the task pa-rameters exceeds 48 MB and there are more than 12 objects published to thestream.
7. Conclusion and Future Work
This paper demonstrates that task-based workflows and dataflows can beintegrated into a single programming model to better-cover the needs of the newData Science workflows. Using Hybrid Workflows, developers can build complexpipelines with different approaches at many levels using a single framework.The proposed solution relies on the
DistroStream concept: a generic APIused by applications to handle stream accesses homogeneously regardless of thesoftware backing it. Two implementations provide the specific logic to supportobject and file streams. The first one,
ObjectDistroStream , is built on top ofKafka to enable object streams. The second one,
FileDistroStream , monitors thecreation of files inside a directory, sends the file locations through the stream,and relies on a distributed file system to share the file content.43he
DistroStream
API and both implementations are part of the
DistroStream-Lib , which also provides the
DistroStream Client and the
DistroStream Server .While the client acts as a broker on behalf of the application and interactswith the corresponding backend, the server manages the streams’ metadata andcoordinates the accesses.By integrating the
DistroStreamLib into a task-based Workflow Manager, itsprogramming model can easily support Hybrid Workflows. The described pro-totype extends COMPSs to enable tasks with continuous input and output databy providing a new annotation for stream parameters. Implementing the han-dling of such stream-type values lead to some modifications on the Task Analyserand Task Scheduler components of the runtime. Using the
DistroStreamLib alsoimplied changes at deployment time since its components need to be spawnedalong with COMPSs. On the one hand, the COMPSs master hosts the
Dis-troStream Server , the required stream backend, and a
DistroStream Client tohandle stream accesses on the application’s main code. On the other hand,each COMPSs worker contains a
DistroStream Client that performs the streamaccesses on tasks. Although the described prototype only builds on COMPSs,it can be used as an implementation reference for any other existing task-basedframework.This paper also presents four use cases illustrating the new capabilities thatthe users may identify in their workflows to benefit from the use of HybridWorkflows. On the one hand, streams can be internal or external to the appli-cation and can be used to communicate continuous data or control data. Onthe other hand, streams can be accessed inside the main code, native tasks (i.e.,Java or Python), or non-native tasks (i.e., MPI, binaries, and nested COMPSsworkflows). Furthermore, the
Distributed Stream Library supports the one tomany, many to one, and many to many scenarios transparently, and allows toconfigure the consumer mode to process the data at least once, at most once,or exactly once when using many consumers.The evaluation demonstrates the benefit of processing data continuously asit is generated; achieving a 23% gain with the right generation and process times44nd resources. Also, using streams as control mechanism enabled the removalof synchronisation points when running several parallel algorithms, leading to a33% gain when running more than 32 iterations. Finally, an in-depth analysis ofthe runtime’s performance shows that there are no major differences regardingthe task analysis time nor the task scheduling time when using streams or objecttasks, and that the use of streams is recommended when the total size of thetask parameters exceeds 48 MB using more than 12 objects.Although the solution is fully functional, some improvements can be made.Regarding the
DistroStream implementations, we plan to extend the
FileDis-troStream to support shared disks with different mount-points. On the otherhand, we will add new
ObjectDistroStream ’s backend implementations (apartfrom Kafka), so that users can choose between them without changing the ap-plication’s code. Envisaging that a single
DistroStream Server could becomea bottleneck when managing several applications involving a large number ofcores, we consider replacing the client-server architecture by a peer-to-peer ap-proach. Finally, by highlighting the benefits from hybrid flows, we expect toattract real-world applications to elevate our evaluation to more complex usecases.
Acknowledgements
This work has been supported by the Spanish Government (contracts SEV2015-0493 and TIN2015-65316-P), by Generalitat de Catalunya (contract 2014-SGR-1051), and by the European Commission through the Horizon 2020 Research andInnovation program under contract 730929 (MF2C project). Cristian Ramon-Cortes predoctoral contract is financed by the Spanish Government under thecontract BES-2016-076791.
ReferencesReferences [1] P. A. Damkroger, The Intersection of AI, HPC and HPDA: How Next-Generation Workflows Will Drive Tomorrows Breakthroughs, Accessed:45ugust 07, 2019 (2018).URL [2] J. Kreps, N. Narkhede, J. Rao, et al., Kafka: A distributed messagingsystem for log processing, in: Proceedings of the NetDB, 2011, pp. 1–7.URL http://people.csail.mit.edu/matei/courses/2015/6.S897/readings/kafka.pdf [3] R. M. Badia, et al., COMP superscalar, an interoperable programmingframework, SoftwareX 3 (2015) 32–36.URL https://doi.org/10.1016/j.softx.2015.10.004 [4] Barcelona Supercomputing Center (BSC), COMP Superscalar — BSC-CNS, Accessed: March 26, 2019.URL http://compss.bsc.es [5] A. Jain, S. P. Ong, W. Chen, B. Medasani, X. Qu, M. Kocher, M. Brafman,G. Petretto, G.-M. Rignanese, G. Hautier, et al., Fireworks: A dynamicworkflow system designed for high-throughput applications, Concurrencyand Computation: Practice and Experience 27 (17) (2015) 5037–5059.[6] A. Jain, Introduction to FireWorks (workflow software) - FireWorks 1.8.7documentation, Accessed: March 26, 2019.URL https://materialsproject.github.io/fireworks/ [7] K. Wolstencroft, R. Haines, D. Fellows, A. Williams, D. Withers, S. Owen,S. Soiland-Reyes, I. Dunlop, A. Nenadic, P. Fisher, J. Bhagat, K. Belhaj-jame, F. Bacall, A. Hardisty, A. Nieva de la Hidalga, M. P. Balcazar Vargas,S. Sufi, C. Goble, The Taverna workflow suite: designing and executingworkflows of Web Services on the desktop, web or in the cloud, NucleicAcids Research 41 (W1) (2013) W557–W561. doi:10.1093/nar/gkt328 .URL https://doi.org/10.1093/nar/gkt328 [8] Taverna Committers, Apache Taverna, Accessed: April 1, 2019.URL https://taverna.incubator.apache.org/ https://kepler-project.org/ [11] E. Afgan, D. Baker, M. van den Beek, D. Blankenberg, D. Bouvier,M. ˇCech, J. Chilton, D. Clements, N. Coraor, C. Eberhard, B. Gr¨uning,A. Guerler, J. Hillman-Jackson, G. Von Kuster, E. Rasche, N. Soranzo,N. Turaga, J. Taylor, A. Nekrutenko, J. Goecks, The Galaxy platform foraccessible, reproducible and collaborative biomedical analyses: 2016 up-date, Nucleic Acids Res. 44 (W1) (2016) W3–W10.[12] Galaxy Team, Galaxy, Accessed: March 26, 2019.URL https://usegalaxy.org/ [13] M. Abouelhoda, S. A. Issa, M. Ghanem, Tavaxy: Integrating Tavernaand Galaxy workflows with cloud computing support, BMC bioinformatics13 (1) (2012) 77.[14] P. Di Tommaso, M. Chatzou, E. W. Floden, P. P. Barja, E. Palumbo,C. Notredame, Nextflow enables reproducible computational workflows,Nature biotechnology 35 (4) (2017) 316.[15] Barcelona Centre for Genomic Regulation (CRG), Nextflow: A DSL forparallel and scalable computational pipelines, Accessed: March 26, 2019.URL [16] M. Wilde, M. Hategan, J. M. Wozniak, B. Clifford, D. S. Katz,I. Foster, Swift: A language for distributed parallel script-ing, Parallel Computing 37 (9) (2011) 633 – 652. doi:https://doi.org/10.1016/j.parco.2011.05.005 .47RL [17] Swift Project Team, The Swift Parallel Scripting Language, Accessed:March 26, 2019.URL http://swift-lang.org/main/ [18] University of Chicago, Parsl: Parallel Scripting in Python, Accessed:March 26, 2019.URL http://parsl-project.org/ [19] Dask Development Team, Dask: Library for dynamic task scheduling(2016).URL https://dask.org [20] F. Lordan, R. M. Badia, et al., ServiceSs: an interoperable programmingframework for the Cloud, Journal of Grid Computing 12 (1) (2014) 67–91.URL https://digital.csic.es/handle/10261/132141 [21] Apache Flink Contributors, Apache Flink, Accessed: August 9, 2019(2014).URL https://flink.apache.org/ [22] Apache Samza Contributors, Apache Samza, Accessed: August 9, 2019(2019).URL http://samza.apache.org/ [23] A. Toshniwal, et al., Storm@ twitter, in: Proceedings of the 2014 ACMSIGMOD international conference on Management of data, ACM, 2014,pp. 147–156.[24] S. Kulkarni, et al., Twitter heron: Stream processing at scale, in: Proceed-ings of the 2015 ACM SIGMOD International Conference on Managementof Data, ACM, 2015, pp. 239–250.4825] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, I. Stoica, Spark:Cluster computing with working sets., HotCloud 10 (10-10) (2010) 95.[26] J. Dean, S. Ghemawat, Mapreduce: Simplified data processing on largeclusters, Proceedings of the 6th Conference on Symposium on OpeartingSystems Design and Implementation 6 (2004) 10–10.URL http://dl.acm.org/citation.cfm?id=1251254.1251264 [27] M. Zaharia, et al., Discretized Streams: An Efficient and Fault-TolerantModel for Stream Processing on Large Clusters, HotCloud 12 (2012) 10–16.[28] R. Filgueira, R. F. d. Silva, A. Krause, E. Deelman, M. Atkinson, Asterism:Pegasus and Dispel4py Hybrid Workflows for Data-Intensive Science, in:2016 Seventh International Workshop on Data-Intensive Computing in theClouds (DataCloud), 2016, pp. 1–8. doi:10.1109/DataCloud.2016.004 .[29] E. Tejedor, R. M. Badia, J. Labarta, et al., PyCOMPSs: Parallel computa-tional workflows in Python, The International Journal of High PerformanceComputing Applications (IJHPCA) 31 (2017) 66–82.URL http://dx.doi.org/10.1177/1094342015594678 [30] J. Ejarque, M. Domnguez, R. M. Badia, A hierarchic task-based programming model for distributed heterogeneous computing,The International Journal of High Performance Computing Applica-tions 0 (0) (0) 1094342019845438. arXiv:https://doi.org/10.1177/1094342019845438 , doi:10.1177/1094342019845438 .URL https://doi.org/10.1177/1094342019845438 [31] Barcelona Supercomputing Center (BSC), Extrae Tool, Accessed: July 31,2019.URL https://tools.bsc.es/extrae [32] Barcelona Supercomputing Center (BSC), Paraver Tool, Accessed: July491, 2019.URL https://tools.bsc.es/paraver [33] S. Liang, Java Native Interface: Programmer’s Guide and Reference, 1stEdition, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA,1999.[34] P. Hunt, M. Konar, F. P. Junqueira, B. Reed, ZooKeeper: Wait-free Coor-dination for Internet-scale Systems, in: USENIX annual technical confer-ence, Vol. 8, Boston, MA, USA, 2010, p. 14.[35] Apache ZooKeeper Contributors, Apache ZooKeeper, Accessed: July 31,2019.URL https://zookeeper.apache.org/ [36] Oracle and/or its affiliates, Java NIO, Accessed: January 20, 2020 (2010).URL https://docs.oracle.com/javase/1.5.0/docs/guide/nio/index.html [37] Barcelona Supercomputing Center (BSC), MareNostrum IV TechnicalInformation, Accessed: July 31, 2019.URL [38] Workflows and Distributed Computing - Barcelona Supercomputing Center(BSC), DistroStream Library GitHub, Accessed: September 3, 2019.URL https://github.com/bsc-wdc/distro-stream-lib [39] Workflows and Distributed Computing - Barcelona Supercomputing Center(BSC), COMPSs GitHub, Accessed: July 31, 2019.URL https://github.com/bsc-wdc/compss [40] Apache Kafka Contributors, Apache Kafka, Accessed: July 12, 2019.URL https://github.com/apache/kafkahttps://github.com/apache/kafka