[PDF] Methods and Experiences for Developing Abstractions for Data-intensive, Scientific Applications

Abstract

Developing software for scientific applications that require the integration of diverse types of computing, instruments, and data present challenges that are distinct from commercial software. These applications require scale, and the need to integrate various programming and computational models with evolving and heterogeneous infrastructure. Pervasive and effective abstractions for distributed infrastructures are thus critical; however, the process of developing abstractions for scientific applications and infrastructures is not well understood. While theory-based approaches for system development are suited for well-defined, closed environments, they have severe limitations for designing abstractions for scientific systems and applications. The design science research (DSR) method provides the basis for designing practical systems that can handle real-world complexities at all levels. In contrast to theory-centric approaches, DSR emphasizes both practical relevance and knowledge creation by building and rigorously evaluating all artifacts. We show how DSR provides a well-defined framework for developing abstractions and middleware systems for distributed systems. Specifically, we address the critical problem of distributed resource management on heterogeneous infrastructure over a dynamic range of scales, a challenge that currently limits many scientific applications. We use the pilot-abstraction, a widely used resource management abstraction for high-performance, high throughput, big data, and streaming applications, as a case study for evaluating the DSR activities. For this purpose, we analyze the research process and artifacts produced during the design and evaluation of the pilot-abstraction. We find DSR provides a concise framework for iteratively designing and evaluating systems. Finally, we capture our experiences and formulate different lessons learned.

Full PDF

MMethods and Experiences for DevelopingAbstractions for Data-intensive, ScientiﬁcApplications

Andre Luckow , , and Shantenu Jha , Ludwig-Maximilian University, Munich, Germany RADICAL, ECE, Rutgers University, Piscataway, NJ 08854, USA Brookhaven National Laboratory, Upton, NY, USA

Abstract —Developing software for scientiﬁc applications thatrequire the integration of diverse types of computing, instru-ments, and data present challenges that are distinct from com-mercial software. These applications require scale, and the needto integrate various programming and computational modelswith evolving and heterogeneous infrastructure. Pervasive and ef-fective abstractions for distributed infrastructures are thus criti-cal; however, the process of developing abstractions for scientiﬁcapplications and infrastructures is not well understood. Whiletheory-based approaches for system development are suited forwell-deﬁned, closed environments, they have severe limitationsfor designing abstractions for scientiﬁc systems and applications.The design science research (DSR) method provides the basis fordesigning practical systems that can handle real-world complex-ities at all levels. In contrast to theory-centric approaches, DSRemphasizes both practical relevance and knowledge creation bybuilding and rigorously evaluating all artifacts. We show howDSR provides a well-deﬁned framework for developing abstrac-tions and middleware systems for distributed systems. Specif-ically, we address the critical problem of distributed resourcemanagement on heterogeneous infrastructure over a dynamicrange of scales, a challenge that currently limits many scientiﬁcapplications. We use the pilot-abstraction, a widely used resourcemanagement abstraction for high-performance, high throughput,big data, and streaming applications, as a case study for evaluat-ing the DSR activities. For this purpose, we analyze the researchprocess and artifacts produced during the design and evaluationof the pilot-abstraction. We ﬁnd DSR provides a concise frame-work for iteratively designing and evaluating systems. Finally, wecapture our experiences and formulate different lessons learned.

I. I

NTRODUCTION

New scientiﬁc applications and discoveries are enabled byadvanced data and compute infrastructures, algorithms, andtools. Scientiﬁc progress increasingly depends on driving for-ward the ability to support large-scale computational and datademands of simulations in conjunction with data processing,analytics, and machine learning [1], [2]. The complexity of de-veloping, deploying and scaling scientiﬁc applications arisesfrom various sources, in particular, the increasing heterogene-ity that exists at all levels, from hardware, infrastructure, mid-dleware to software [3].Abstractions are crucial for scalable systems that hide in-ternal complexities and expose simple interfaces [4], [5]. De-signing useful abstractions is challenging: hiding complexitydoes not automatically lead to simple interfaces. The possibledesign space for abstractions is typically vast, and there is noconsensus on what constitutes effective abstractions. Further, there are no accepted recipes to design and develop abstrac-tions for large-scale scientiﬁc distributed compute and datainfrastructures.The particular challenge addressed in this paper is the designof abstractions for resource management on distributed andheterogeneous compute and data infrastructure. Currently, thescale and uptake of scientiﬁc, data-intensive applications arehindered by a reliance on proprietary application and systems-level resource management systems. These are often imple-mented using rigid and ad-hoc approaches [6]. A generalizedabstraction that helps overcome these limitations and enablescalable applications is needed.While formal approaches maybe suitable for closed systems,they have limitations for designing open, scientiﬁc distributedsystems. Iivari emphasizes that the “theory-with-practical-implications research strategy has seriously failed to produceresults that are of real interest in practice [7]” . DSR is an it-erative approach to building, evaluating, and reﬁning softwaresystems. While many research approaches solely focus on the-ory and knowledge, DSR emphasizes practical relevance. Itrealizes that complex systems need to be designed and eval-uated in real-world settings. By introducing a rigorous eval-uation of the produced artifacts, DSR provides generalizableknowledge that informs future design iterations, but can alsobe transferred to other problems.We propose the application of the design science researchmethod (DSR) [8] to the design of an abstraction and mid-dleware for distributed resource management. Speciﬁcally, weapply the DSR method to the process of designing the pilot-abstraction [6]. Based on in-depth studies of different appli-cations, we deﬁne the design objective of the abstraction andsystem. Using the rigorous, iterative DSR process, we design,evaluate, and evolve the abstraction from a compute-centricto an integrated abstraction for managing compute and dataresources and applications. In this paper, we demonstrate thesuitability of DSR for creating well-deﬁned abstractions andimplement these in a real-world system.As part of DSR, we deﬁne different evaluation methodsand criteria for assessing the abstraction. For example, we in-vestigate the usability and versatility of the abstraction in sev-eral case studies, e. g., in ensemble-based simulations, MapRe-duce, and stream processing applications. We use conceptualmodeling to provide and validate our understanding of the a r X i v : . [ c s . D C ] M a r nvironment Research Knowledge Base Relevance Rigor

Predict/Evaluate • Case studies • Conceptual modeling • Performance measure ‐ ment and modeling • Mini-App FrameworkReﬁneAssess

Infrastructure • HPC • HTC • Cloud • … Application • Characteristics • Application patterns • Datasources: IoT, In ‐ struments • Dependencies • Performance require ‐ ments • … Applicable KnowledgeApplicationNeeds

Foundations • Parallel and dis ‐ tributed computing • Frameworks • … Approaches • Conceptual modeling • Software engineering methods • Design patterns • Architecture models • Performance measure ‐ ments • Performance models • … Design/Develop

Design abstractions for re ‐ source management on het ‐ erogeneous infrastructure for data-intensive applications enabling per ‐ formance and scale. Addition to knowledge baseApplication in appropriateenvironment

Fig. 1:

Design Science Research Method (adapted fromHevner [8]):

To address the complexity of the problem space,we follow an iterative research approach of continuously build-ing and evaluating abstractions.pilot-abstraction and the underlying mechanisms. Further, westudy different implementations of the abstraction concerningthe performance and scalability using different types of ap-plications, e. g., from the domains of genome sequencing andlight source sciences.This paper makes the following contributions: (i) it uses theDSR framework to assemble a set of methods for the designand evaluation of abstractions, (ii) it demonstrates the valid-ity of DSR for designing and evaluation abstractions, such asthe pilot-abstraction, (iii) it surveys publications related tothe pilot-abstraction and investigates the used methods for de-sign and evaluation, and (iv) it synthesizes the experiencesgathered during this process in a set of lessons learned. DSRwas initially introduced in the domain of information systemresearch; we believe this is its ﬁrst application to scientiﬁcdistributed computing.This paper is structured as follows: We begin with an intro-duction of the methodology in Section II, and continue with aninvestigation of scientiﬁc applications and their characteristicsin Section III. The result is ﬁve application scenarios that theabstraction needs to address. We present the pilot-abstractionin Section IV. In Section V, we discuss the methods used forevaluating the system. We discuss our learnings and experi-ences of applying DSR in Section VI.II. M

ETHODOLOGY

The objective of this section is to provide an introductionto design science research (DSR) [8]. DSR avoids the limita-tions of theory-based approaches, in particular, their inabilityto capture complex, real-world systems. It emphasizes the iter-ative creation, evaluation, and reﬁnement of systems. The com-plexity of scientiﬁc applications and infrastructure make DSRsuitable for designing abstractions that enable applications toscale across heterogeneous infrastructure. For this purpose, wecustomize DSR and apply it the ﬁrst time to the problem ofabstraction development (see Figure 1).The build-evaluate-reﬁne cycle has two primary inputs: Theenvironment provides essential context for the problem, in par-ticular, concerning application requirements, characteristics,

ApplicationsPatternscontains Abstractionssupportsuses Middleware/Building Block implemented by

Fig. 2:

Application Analysis (adapted from [12]):

Patternsemerge by observing application characteristics and implemen-tations.and infrastructure. The knowledge base deﬁnes, in particu-lar, the foundations and methodologies used the evaluation.In the following, we give an overview of the different designscience research activities (adapted from Peffer [9]): (i) prob-lem identiﬁcation, (ii) deﬁnition of objectives, (iii) design andimplementation, (iv) demonstration, and (v) evaluation.

A. Problem Identiﬁcation and Objectives

Before starting the design process, an understanding of theproblem and design objectives is essential. Common meth-ods for this activity are literature reviews, expert interviews,focus groups, and surveys [10]. In scientiﬁc application de-velopment, requirements frequently only emerge during thecreation of the systems leading [11]. Thus, iterative methods,such as DSR, are instrumental. The challenge addressed bythis paper is the design and development of effective abstrac-tions that provide the right level of detail and support a vari-ety of application scenarios and infrastructure while retainingease-of-use.

B. Design1) Abstraction Design:

Abstractions are a fundamentalmethod of computer science, enabling reasoning about a prob-lem at the right level while allowing the underlying system toimplement a solution [4]. Shaw deﬁnes an abstraction as “asimpliﬁed description or speciﬁcation of a system that empha-sizes some of the system’s details or properties while suppress-ing others [5].”

To develop efﬁcient abstractions, an understanding of ap-plications and infrastructure usage modes is instrumental. Animportant foundation for the development of abstractions arepatterns. Patterns are suitable solutions to recurring problemsin a particular context that can be applied multiple times with-out doing it exactly the same ways [13], [12]. Patterns canbe discovered by observing common problem decompositions(e. g., task and data partitioning), communication, and coor-dination structures in applications. Jha et al. utilize this pro-cess to study patterns and abstractions for distributed applica-tions [12]. Mattson et al. [13] investigate patterns for paralleland distributed applications. Figure 2 illustrates the relation-ship between applications, patterns, abstractions, and systems.Discovered patterns serve as candidates for the development ofabstractions and their implementation in a middleware system.An abstraction represents the external interface of the sys-tem. Thus, careful design is essential. The desired propertiesf an abstraction are generality and simplicity [14]. General-ity refers to the ability of the abstraction to be broadly used.Simplicity is reﬂected in multiple properties, e. g., ease-of-use,maintainability, and extensibility [15].The development of abstractions is a difﬁcult task and re-quires the identiﬁcation of essential concepts, properties, andrelationships. Conceptual modeling enables abstract thinkingand reasoning about a system and its abstraction. Conceptualmodels represent and describe systems, e. g., applications, sys-tems, and infrastructure. They can be used to formulate con-cepts about the system, explaining how a system works. My-lopoulos deﬁnes a conceptual model as ‘ ‘a description of anaspect of the physical and social world around us for the pur-poses of understanding and communication [16].”

Johnson de-ﬁnes a conceptual model as “a high-level description of howa system is organized and operates [17].” Conceptual mod-els were introduced to computer science in 1984 by Brodie,Mylopoulos, and Schmidt to overcome the increasing special-ization of computer science disciplines to describe high-levelaspects and interactions better [18]. Conceptual models areused in different areas, e. g., for software architectures [19],and programming languages [5].

2) System Design:

Software architecture and engineeringresearch is the study of useful system organization [20], i. e.,the design of the composition/decomposition of systems andsubsystems and the communication between these. Commonobjectives of the system design are ﬂexibility, maintainability,and re-usability [19]. Patterns are an important aspect of de-signing software systems. Initially, introduced in the domain ofbuilding architecture [21], were adopted to the domain of soft-ware architecture and engineering by Beck/Cunningham [22].A fundamental principle of system design is modulariza-tion and decomposition. Modularization has many beneﬁts,e. g., ﬂexibility, comprehensibility, and maintainability [23].Further, the development time can be reduced by the abil-ity to distribute work across different groups. According toParnas, the most critical criteria for organizing a system is in-formation hiding [24], i. e., the ability to carefully control theinformation exposed by a component using well-deﬁned exter-nal interfaces and hiding information that is likely to change.A common type of modularization used by complex systemsis a layered architecture [25]. The layered architecture modelpartitions the system in distinct hierarchical layers. Each layerencapsulates a deﬁned set of functions and provides servicesto the layer above. This pattern is widely used in system-levelsoftware, such as databases, operating systems, middleware,and distributed system.Similar to a layered architecture model is the hourglassmodel [14], which relies on a central bottleneck layer at thewaist of the hourglass that connects a wide range of lower-leveland higher-level services. Resource management is commonlydescribed using hourglass models [26], [27].

C. Evaluation

A rigorous evaluation of all artifacts is a crucial part of theDSR process. Sonnenberg/Brocke [10] propose four evalua- tion activities. Eval 1 evaluates the problem statements, usingmethods such as literature review and surveys. Eval 2 investi-gates the design speciﬁcations, e. g., using expert interviews,demonstrations, simulations, and benchmarking. Eval 3 is con-cerned with the instantiation of an artifact, e. g., a prototype,using methods, such as experimentation. Last, Eval 4 observesthe artifact in the real world. We utilize different evaluationmethods for different activities, in particular, case studies [28],performance characterization, and modeling [29]. In the fol-lowing, we particularly focus on methods for performance andscalability evaluations, i. e., Eval 3 and 4.

1) Performance Characterization:

Performance measure-ments and characterizations are common methods for describ-ing a system in artiﬁcial and natural settings (Eval 3/4). Perfor-mance measurements can have different objectives: (i) work-load and system characterization, (ii) performance improve-ments, and (iii) to evaluate design alternatives [30]. An essen-tial component of a performance evaluation is the workloaddeﬁned as the set of all inputs (programs and data) that a sys-tem receives from its environment [30]. A benchmark refersto a workload that is used to compare computer systems. Aworkload used in performance evaluations should be represen-tative; in the best case, it should reﬂect an actual, real-worldworkload.

Benchmarking refers to the process of comparingtwo or more computer systems [29]. A measure describes theperformance of a system, e. g., the runtime or throughput of asystem. More complex metrics can capture cost/price or qual-ity/runtime trade-offs [31].Scientiﬁc applications are complex, unique, and not well-represented by standard benchmarks. The chosen metrics of-ten do not provide a comprehensive view of the system, andthus, are not a proxy for real-world performance [31]. Further,most benchmarks neglect application-level quality metrics andfocus mostly on runtime and scaling performance. However,it is often challenging to obtain real-world performance datathat provide useful insights. For example, for data-intensiveapplications, there are, e. g., complex infrastructure compo-nents, such as data source, broker, and processing applications,that need to be carefully controlled. To account for that, oftensimpliﬁed, synthetic workloads are used to study performance(e. g., the Mini-App framework [32]). Similar techniques arecommonly used for generating reproducible data and computeworkloads, see [33], [34], [35].

2) Performance Models:

Performance models [36] are away of abstracting performance-related insights into an ana-lytical model. An analytical model is a precise formulation ofa model using mathematical logic, entities, and relations todescribe concepts [37]. Analytical models are white-box andcan quantify the relationship between the different concepts.Statistical models, in contrast, derive insights and predictionsfrom data [38]. The advantage of statistical models is that theydo not require domain knowledge and can model highly com-plex domains. However, they are often black-box models, i. e.,they are more difﬁcult to interpret.Many computer science domains use performance models,e. g., for programming languages, operating systems, databaseystems, system components (e. g., schedulers), as well as par-allel and distributed systems. For example, database systemsutilize cost-based optimizer to generate an optimal query ex-ecution plan [39]. A well-known performance model for dis-tributed applications is Amdahl’s Law [40].III. A

PPLICATION S CENARIOS , C

HARACTERISTICS AND R EQUIREMENTS

Understanding the problem domain is an essential step inthe design process. In this section, we discuss important ap-plication characteristics and requirements.

A. Application Characteristics

The requirements of scientiﬁc applications are growingmore diverse and complex [41], [42]. An increasing numberof instruments, such as light sources, telescopes, and genomesequence machines, generate vast volumes of data. Applica-tions are becoming more sophisticated and increasingly requirethe combination of various processing types, e. g., simulation,analytics, and machine learning. These processing types im-pose different requirements on abstractions, middleware, andinfrastructure.While the integration of different processing types is chal-lenging, it yields many beneﬁts, e. g., it has been demonstratedthat machine learning-based approximation techniques can im-prove simulations (e. g., by quickly identifying regions of in-terests). Another example is the guidance of experiments usingmachine learning, e. g., to ﬁnd interesting events and regions,and to adapt sampling accordingly [41].While data-intensive, scientiﬁc applications are highly di-verse, they often share common computational and data char-acteristics. Early studies, e. g., the Berkeley Dwarfs [43], fo-cused on the understanding of parallel algorithms based ontheir computation and data movement patterns. Jha et al. [12]study distributed applications. In D3 science [44], we con-ducted a survey consisting of 9 questions and a series of work-shops to understand the distributed and dynamic data aspectsof 13 scientiﬁc applications. The Big Data Ogres [45], [46]introduce a multi-dimensional framework, so-called facets,which represent key characteristics of big data applicationsand use them to deﬁne a set of Mini Apps based on a studyof more than 50 use cases collected by NIST [47].Based on an investigation of 50+ applications and theircharacteristics in [44], [45], [52], we derived ﬁve applicationsscenarios: task-parallel, data-parallel, dataﬂow, iterative, andstreaming (see Table I). In the following, we discuss importantcharacteristics and patterns found in these application scenar-ios.An important characteristic is the decomposition pattern:Task-level parallelism describes the execution of diverse com-pute tasks on multiple compute resources. In contrast, data-parallelism creates tasks by partitioning the data. Abstractions,such as MapReduce [56], enable the data-parallel processingand aggregation of data using high-level primitives. The run-time system then handles the implementation of the paral-lelism, i. e., the partitioning of the data, the mapping of data to tasks, the orchestration and synchronization of tasks anddata movements.The dataﬂow model further generalizes the data-parallelmodel by supporting applications comprising of multiplestages of processing. The abstraction is based on directedacyclic graphs, where nodes represent multiple stages of pro-cessing and the ﬂow of data between these stages. It was in-vented in the 1960s at MIT [57] and later adapted to the do-main of data-intensive computing (LGDF2 [58], Dryad [59])as a way to describe data processing pipelines comprising ofmultiple stages, e. g., map, reduce, shufﬂing. A stage can alsobe comprised of an external application (e. g., a simulation).Iterative computation is a scenario applicable in particularto model training in machine learning applications. An impor-tant requirement of these types of applications is the need tocache data to facilitate reading and processing data multipletimes [60]. In machine learning applications, this pattern isoften found as many optimization techniques require multiplepasses on the data to compute and update model parameters.The last scenario is stream processing, deﬁned as the abilityto process unbounded data feeds and provide near-realtime in-sights [61]. The processing patterns for streaming are similar.However, the amount of data is typically smaller, as messagesare processed in small batches. The management of state be-tween individual messages can be required. Stream processingis used to analyze data streams from scientiﬁc experiments,e. g., light source sciences [32].

B. Application Requirements

To support these scenarios, an efﬁcient resource manage-ment abstraction and middleware that can support highly di-verse task-based workloads is required. The heterogeneity ofthe tasks in these application scenarios is high, i. e., oftentasks with diverse runtime, resource, and data requirementsneed to be efﬁciently managed. For example, complex sce-narios require the management of both long-running tasks,e. g., simulation tasks, and short-running tasks, e. g., data-parallel tasks arising from analytics applications. Further, data-intensive applications can be highly unpredictable due to data-dependencies and as a result very complex task graphs. Thenecessity to respond to dynamic events demands support fortask creation at runtime. The requirements for the abstractionand middleware can be summarized as follows:R1

Abstractions:

Provide a higher-level abstraction that hidesthe details of complex distributed infrastructure, but allowsreasoning about trade-offs. The abstractions should be sim-ple and easy-to-use, while supporting as many applicationscenarios as possible ( applicable ). Further, it should be generalizable to multiple systems and implementations.R2

Middleware for Application-Level Resource Manage-ment:

Provide the ability to manage highly diverse par-allel and dependent tasks and associated data on hetero-geneous infrastructure comprising of complex hardwareand software stacks. The system should support the inter-operable use of heterogeneous infrastructures, in particu-lar high performance computing (HPC), high throughput ask-Parallel Data-Parallel/MapReduce Dataﬂow Iterative Streaming

Description Focus on functional de-composition into tasks andcontrol ﬂow Decomposition based on datawith minimal communica-tion between tasks Multiple processing stagesmodeled with a directedacyclic graph Multiple generations oftasks with sharing of databetween the generations Processing of unboundeddata feeds in near-realtimeCharacteristics Decomposition of a prob-lem into a diverse set of de-pendent and parallel tasks Embarrassingly parallel,loosely-coupled with mini-mal communication. Details,such as communication andsynchronization hidden fromthe application Multiple stages, loosely-coupled parallelism, globalcommunication for shufﬂeoperation Loosely coupled par-allelism with globalcommunication for up-dating machine learningmodel parameters Data is processed in smallbatches often using data-parallel algorithms. Formany algorithms, a globalstate needs to be main-tained across batches ofdataApplicationExample Molecular Dynamics [48],[49], Ensemble-KalmanFilter [50], Scientiﬁc Gate-ways and Workﬂows [51] Map-Only analytics [52],Molecular Data analysisHausdorff Distance [53] MapReduce for sequencealignment [54], MolecularData analysis leaﬂet ﬁnderand RMSD [53] Machine learning algo-rithms, K-Means [55] Streaming for light sourcedata [32]

TABLE I:

Data-Intensive Application Scenarios – Characteristics and Patterns:

Data-intensive applications are morecomplex than compute-oriented applications and require the management of data, I/O and compute resources.computing (HTC) and cloud infrastructures, and should be extensible to new frameworks and applications.R3

Dynamism and Adaptivity:

Ability to respond to changesin the environment at runtime. Both middleware and ab-straction need to support this capability.R4

Performance, Scalability, and Efﬁciency:

The sys-tem should provide adequate performance, mainlyhigh-throughput and low latencies, for highly diversetask-based workloads. By doing so, the system sup-ports the strong and weak scaling of applications whileensuring efﬁcient resource usage.IV. P

ILOT -A BSTRACTION : A N A BSTRACTION AND M ODELFOR D ISTRIBUTED R ESOURCE M ANAGEMENT

The need for tools and high-level abstractions to supportapplication development and the extreme heterogeneity of in-frastructures has been widely recognized [12]. Resource man-agement is a fundamental challenge in distributed and parallelcomputing. The current state is characterized by highly hetero-geneous and fragmented systems, rigid point solutions, and alack of a uniﬁed model for expressing data and compute tasks.The advent of data-intensive, machine learning, and streamingapplications complicated the state even further. Infrastructureis getting more complicated by introducing new storage andmemory tiers as well as accelerators.Data-intensive applications exhibit complex characteristicsand demand a highly ﬂexible abstraction for allocating re-sources and managing highly diverse workloads of tasks. Bal-ancing application characteristics and infrastructure requirescareful consideration of application-level and infrastructure-level concerns. High-level abstractions are critical to retaindeveloper productivity and to scale applications. For exam-ple, the ability to manage resources efﬁciently, taking intoaccount application objectives is important. By infusing ap-plication knowledge (e. g., about the data, compute and I/Ocharacteristics) into scheduling decisions, the runtime and scal-ability can be signiﬁcantly improved [62]. Thus, data needs tobe integrated into these abstractions as ﬁrst-class citizen.

A. Pilot-Abstraction and Conceptual Model

The pilot-abstraction [6] is a uniﬁed abstraction for re-source management on heterogeneous infrastructure from high-performance computing, high-throughput computing, bigdata, and cloud for distributed applications. In the following,we discuss the experience of developing and extending theabstraction and underlying middleware systems to supportthe described application scenarios. In this section, we focuson the DSR activities and artifacts related to the design andcreation of the abstraction and middleware system.We follow the iterative design approach of DSR closelyaligning the abstraction design to real application needs. Theﬁrst system focused on the design of an application-internalresource management framework for replica-exchange simu-lations [48]. The resource management capabilities were eval-uated using different application scenarios as case studiesand performance measurements on different HPC infrastruc-tures focusing on the internal resource management subsystem.Based on the positive evaluation, we generalized the abstrac-tion and created a re-usable, application-agnostic, standalonepilot-job system called BigJob [63].The observation of similar concepts for other infrastructuresand applications [64] motivated the development of the pilot-abstraction [6], a general abstraction for the re-occurring con-cept of utilizing a placeholder job as a container for a set ofcomputational tasks. The abstraction comprises of two mainconcepts, the pilot that represents a placeholder for a spe-ciﬁc set of resources and the compute-unit, a self-containedtask. The implementation of the pilot-job system conceals de-tails about the resource management systems of the differentinfrastructures (e. g., HPC, HTC, and clouds). Thus, the usercan focus on the composition of tasks rather than dealing withinfrastructure speciﬁc aspects.The pilot-abstraction addresses the need to efﬁciently andﬂexibly manage resources on application-level across dis-tributed, heterogeneous infrastructure. Pilot-jobs provide twokey capabilities: (i) they support the late binding of resourcesand workloads, and (ii) they provide a higher-level abstrac-tion for the speciﬁcation of application workloads removingthe need the manage the execution of the workload manually.At the same time, they provide critical capabilities to composetask- and data-parallel workloads while providing optimal scal-ability and performance by managing task granularities, datadependencies, and I/O via the abstraction. The design of the roblem Statement:

Complexity of infrastructure and applications prevents scalability and scientific progress.

Criteria:

Applicability, Runtime, Speedup, Throughput, Generality

Construct:

Pilot-Based Middelware for distributed and dynamic data.

Design:

The Pilot-Abstraction provides common resource management for different types of applications across different infrastructures.

Use:

Performance Characterizations,

Application Case Studies, Mini-Apps and Performance Modeling.

Eval 2Eval 1Eval 3Eval 4

Criteria:

Importance,Applicability

Criteria:

Simplicity, Generality,Applicability

Criteria:

Feasibility,

Generality, Extensibility, Interoperability

Fig. 3:

DSR Evaluation Activities and Criteria (adaptedfrom Sonnenberg [10]):

The incremental evaluation providesvaluable input for reﬁnement and valuable knowledge that canbe transferred to other problems.pilot-abstraction aims to offer a simple as possible and generalinterface to these capabilities.Another artifact of the design process is a conceptual modelfor understanding pilot-based systems. The P* model [6] aimsto provide a common framework to understand the abstraction,as well as commonly used pilot-job systems. The P* modeldeﬁnes the high-level concepts and mechanisms found in mostpilot-systems. The model speciﬁcation is done using a func-tional description of the components and interactions, as wellas various components and interaction diagrams. The charac-teristics and interactions of all concepts and analysis of dif-ferent pilot-systems using the model is available in [6]. Turilliet al. [65] reﬁned the framework.While the pilot-job concept was developed for HPC andHTC, the need to manage data in conjunction with pilotsand tasks became apparent. Pilot-Data [66] extends the pilot-abstraction and provides the ability to manage storage anddata, and couple these effectively with computational tasks.With the emergence of big data frameworks, such as Hadoop,Spark, and Dask, the ability to couple HPC applications tospecialized data processing engines become increasingly im-portant, which lead to the development of Pilot-Hadoop [67],[68]. Further, extensions for in-memory processing [68], andstreaming [32] have been designed and implemented.

B. Middleware: System Design and Architecture

The objective of the system design phase is to create a sys-tem design and implementation that can support the desiredabstractions. We applied methods and practices described ear-lier to achieve a ﬂexible, maintainable, and comprehensiblearchitecture. The system architecture is based on well-knowndesign patterns [69], e. g., the adaptor pattern for abstractingspeciﬁc resource types, i. e., HPC, cloud, and data infrastruc-tures, such as Hadoop and Spark. For some of these infrastruc-tures, we utilize the SAGA [70] as an access layer for localresource management systems. The design artifacts of the ar-chitecture model are created using block diagrams inspired byUML [71] to visualize system layers, composition, and in-teractions. Examples of architectural models artifacts can be

Application 2Application 1 Application 3

Pilot-APIP* Model

Architecture ModelStatistical ModelConceptual ModelAnalytical ModelPerformance Model

Application scenarios based on common elements and characteristicsDeﬁnes core concepts, their relationships and interactions. Exposed via Pilot-API.Describes the structure of the system Performance model for reasoning and prediction of performance of pilot-system and applications

Fig. 4:

Understanding the DSR Artifacts using DifferentMethods:

Modeling techniques for characterization and eval-uation of the pilot-abstraction and system.found here: Pilot-Job [63], Pilot-Data [66], Pilot-Hadoop [67],and Pilot-Streaming [32].V. E

VALUATION

Evaluation is an essential part of the DSR process and en-sures that the designed system achieves the desired purpose.We explain the distinct types of evaluation conducted on dif-ferent artifacts produced throughout the Eval 1-4 activities pro-posed by Sonnenberg [10]. Figure 3 illustrates the four mainactivities: problem statement, design, construction, and use.We discuss in-depth the used methods and criteria used forevaluating the output of every stage.Table II summarizes the evaluation methods used for the dif-ferent DSR activities and the evolution of the pilot-abstraction.As proposed in [10], we evaluate the system interior, i. e., thearchitecture, as well as the exterior, i. e., the usage of the ab-straction and system.Figure 4 summarizes the modeling methods used. We useconceptual modeling to provide high-level intuition and to al-low reasoning about inevitable trade-offs. Architecture modelsenable the evaluation of the internal structure of the systems.Performance models are used to describe the dynamic prop-erties while using the abstraction and system. Insights fromthe conducted evaluations inform the abstraction design andto provide generalizable knowledge. In the following, we dis-cuss the applied methods and criteria in detail.

A. Problem Identiﬁcation and Design Evaluation (Eval 1/2)

Eval 1 activity, i. e., the justiﬁcation of the problem state-ment and research gap, has been performed in the introductionand Section III. The results of the literature and applicationsurvey deﬁne the design objectives for the pilot-abstraction.The main criteria applied for evaluation of the problem wasthe importance and applicability of the design idea to a broadset of applications.The design of the pilot-abstraction and middleware systemis evaluated according to three main criteria: simplicity, gen-erality, and applicability. An important artifact of the designphase is the P* conceptual model. The model deﬁnes the ele-ments, characteristics, and interactions. The objective of P* isto provide a minimal but complete model that provides an in-tuition of the system. A metric for the simplicity of the modelis the number of elements of the model, which is very lowwith four main concepts. The design of the pilot-abstraction ilot-Job [63], [6] Pilot-Data [66] Pilot-Hadoop [67] Pilot-Memory [68] Pilot-Streaming [32]

Description Management of computational taskson heterogeneous infrastructure Management of data andcompute tasks Management of Hadoopand Spark Management of in-memoryruntimes for iterative tasks Streaming data sources andprocessingInfrastructure HPC, HTC, Cloud HPC, Cloud, Hadoop/-Yarn HPC, Cloud, Hadoop/-Yarn HPC, Cloud, Hadoop/Yarn HPC, Cloud, ServerlessSystem Design(Eval 2) Conceptual model [6], architecturemodel [63] Conceptual model [6],architecture model [66] Architecture model [67] Architecture model [68] Architecture model [32]Performance,Scalabilityand Efﬁciency(Eval 3) Pilot overhead, application and taskruntimes, strong scaling, analyticalmodel for replica-exchange simula-tions [72] Pilot overhead, applica-tion and task runtimes,strong scaling Runtime, strong scaling Runtime, strong scaling Throughput, latency, scal-ability, statistical perfor-mance model for through-put [73]Case Studies(Eval 4) Adaptive Replica Exchange [48],[72], Ensemble Kalman Filter simula-tions [50], HIV binding [49], scienceportals [51], Pilot-MapReduce [54] Genome Sequencing, K-Mean [66], [55] Wordcount, K-Means K-Means Light source data recon-struction, K-Means

TABLE II:

Evaluation:

Overview of Case Studies, Modeling Approaches and Performance Evaluation Methods Used.reduces the amount of code necessary signiﬁcantly while pro-viding interoperability across different infrastructures. Further,we demonstrate the model’s generality by comparing and map-ping different implementations of the pilot-abstraction [6].

B. System Implementation (Eval 3)

The Eval 3 activity evaluates the pilot-abstraction in artiﬁ-cial settings. The developed conceptual models provide an im-portant basis for the construction of the system and the perfor-mance evaluation by offering essential information about thestructure and expected behavior of the system.The prototype implementation of the pilot-system is evalu-ated using an architecture model comprising of several compo-nent and interaction diagrams. The main criteria are feasibility,extensibility, interoperability.The feasibility and generality of the abstraction is shown invarious prototype and production implementations [63], [74].Various extensions, e. g., for data management, in-memoryprocessing, and in support of new infrastructures, such ascloud and serverless, demonstrate the extensibility of the sys-tem. The implementation maps the pilot-abstraction to the dif-ferent infrastructures enabling interoperability. We veriﬁed theinteroperability by various experiments with a broad set of dis-tributed HPC and data-intensive applications.

C. Performance and Case Studies (Eval 4)

An important objective of the pilot-abstraction is to over-come barriers to scaling. Thus, performance and scalabilityare essential evaluation criteria as both are instrumental forthe many scientiﬁc applications. We use three approaches: (i)performance characterization of the pilot-system and severalapplications, and (ii) analytical performance modeling and (iii)statistical performance modeling for selected use cases.As benchmarks do not correctly reﬂect the requirementsof scientiﬁc applications, we rely on custom experiments forevaluations. A challenge for performance characterizations andmodeling is the experimental design and data collection. Theexperimental design is the process of determining the factors,factor levels, and combinations of these for an experiment tounderstand the effect of each factor while minimizing the num-ber of experiments [29], [75]. A good experimental design isessential to capture essential characteristics while minimizingdata collection efforts. We propose the Mini App framework [32] to address thesechallenges and to automate and accelerate the build-assess-reﬁne cycle. The Mini App framework helps to evaluate ab-stractions, middleware, and infrastructure in real-world condi-tions. Further, the data collected can serve as a basis for sta-tistical models and predictions. It was designed to support anexcellent experimental design following best practices deﬁnedby Gray [31] and Waller [76]: (i)

Simplicity:

Easy-to-use andsetup via high-level APIs and conﬁgurations. (ii)

Relevance:

It gives the developer full control of the application workloadand metrics necessary for the application scenario. (iii)

Scal-ability:

Support for distributed resources and datasets at vari-ous scale levels and data rates. (iv)

Portability:

Infrastructureand application-agnostic by design. Different types of infras-tructure supported via pilot-abstraction. (v)

Reproducibility:

It provides comprehensive automation of performance exper-iments ensuring repeatability and reproducibility.Another important aspect of DSR is the ability to deriveknowledge and insights. We use different modeling approachesto generalize abstractions, systems, and applications. For ex-ample, we provide analytical models for the performance ofthe application and pilot-systems [72], [66]. These models cap-ture the signiﬁcant components of the runtime and allow usersto understand the impact of input data volume and parallelismon the runtime. Further, it enables the assessment of the sys-tem overheads and their ratio to the overall runtime of theapplication. Further, we use statistical modeling, e. g., for theprediction of the throughput of streaming systems for differentinfrastructure conﬁgurations [73].Further, we evaluate the applicability of the abstraction ina natural setting, e. g., in various applications [50], [77], andframeworks [54]. In these investigations, we assess whetherthe pilot-abstraction meets the deﬁned requirements concern-ing its capabilities, simplicity, and the feasibility to imple-ment, deploy, and execute applications. In particular, we fo-cus on the resource management requirements, such as theability to adapt to changing resource needs, while providingadequate performance and scalability. The abstraction proveduseful to capture the critical parameters necessary to expresstask and data decompositions and the associated performancetrade-offs. In various case studies, we demonstrated that theabstraction allows a suitable control of the compute and dataovements. VI. D

ISCUSSION

Abstractions are vital for handling complexity and build-ing systems at an unprecedented scale. We present a balancedapproach using the design science research method to de-sign and evaluate the pilot-abstraction, an abstraction for en-abling resource management across heterogeneous, distributedresources. By iteratively addressing real-world application andsystem challenges using DSR as a methodological framework,we were able to develop and reﬁne the pilot-abstraction. Theincremental evaluation of the artifacts of the DSR processprovides valuable input for future iterations and generalizableknowledge for similar problems.Using DSR, we designed and developed the pilot-abstraction, and evaluated it against the deﬁned requirements:R1

Abstractions:

The Pilot-Abstraction’s capabilities andsimplicity have been evaluated and validated in severalapplication scenarios, e. g., ensemble simulations, data-intensive applications, and streaming. Further, the exten-sive usage of the Pilot-Abstraction for higher-level build-ing blocks, e. g., a workﬂow framework [78], an ensemblesimulations management framework, and a MapReduceframework [54], demonstrates its viability and usefulness.R2

Middleware for Application-Level Resource Manage-ment:

The pilot-system provides interoperable use ofHPC, cloud, and data infrastructures. In [79], we explorethe interoperable use of HPC, HTC, and clouds. In [66],we use and characterize the use of Pilot-Data on HPCand HTC resources. The system is extensible to new in-frastructures, such as Hadoop [67], streaming [32], andserverless [73].R3

Dynamism and Adaptivity:

An important capability ofthe pilot-abstraction is the ability to respond to changesin the environment at runtime. In [63], we explore theusage of additional cloud resources at runtime to meetapplication demands. In [73], we demonstrated a modelfor throughput prediction to determine the optimal set ofresources for a given workload.R4

Performance, Scalability, and Efﬁciency:

We demon-strated in various studies that the pilot-abstraction en-ables the creation of scalable applications by given ﬁne-grained control on data/task composition while hiding thedetails [63], [66], [53], [73].In the following, we describe and synthesize our experiencesof the development of the pilot-abstraction in a set of lessonslearned to inform the design process of future systems.

Iteration:

The iterative design and evaluation process ofDSR is instrumental in creating appropriate abstractions andmiddleware systems. Building real systems and applicationsis instrumental in discovering new usage modes and furtherrequirements. Implementing smaller working systems is in-strumental before scaling to more extensive resources and fur-ther applications. Speciﬁcally, we iteratively grew the pilot-jobsystem from supporting coarse-grained ensembles of simula-

Abstractions Performance & ModelingAssessReﬁneCase Study ApplicationsMini-App Framework

Pilot-System

Assess AssessReﬁne Reﬁne A pp li ca t i o n R e q u i r e m e n t s K n o w l e dg e B a s e Research

Fig. 5:

Iterative Research Approach:

Using an iterative feed-back loop of abstraction design and evaluation using real-world and synthetic applications to reﬁne design and system.tion tasks on single infrastructures to support for high-volume,ﬁne-grained data-parallel tasks, and streaming.

Automation:

Collecting data on the design is an instru-mental part of the process. Automating experiments for per-formance characterizations and measurements is important toenable the exploration of larger parameter spaces and to en-sure reproducibility. We developed the Mini Apps frameworkto formalize and automate the experiments and data collection.Figure 5 illustrates the feedback loop used for the design ofpilot-abstraction and the implementation in the pilot-system.By using continuous evaluations, partially automated with theMini App framework, valuable inputs for the abstraction andexperimental design and modeling process are generated.

Abstraction Design:

The design process is complex andrequires the careful trade-off of capabilities, simplicity, andgenerality. The more application-speciﬁc knowledge can beinduced via abstractions into middleware systems, the betterthe decision the system can make, e. g., concerning schedul-ing. However, the more application-speciﬁc the abstraction,the less general is its utility. Balancing simplicity, generality,and capability is challenging and requires a careful evaluationof the abstraction in different applications and settings.

Compute and Data:

Managing heterogeneous computetasks at scale is challenging by itself. The addition of datacomplicates the problem signiﬁcantly. There is a signiﬁcantamount of heterogeneity and dynamism in the way data canbe stored, transferred, and used. Typically, a great extent ofthe data lifecycle is external to the applications. We addressthese challenges, particularly by focusing on deﬁned applica-tion scenarios (see Table I) and by supporting and optimizingfor important patterns, e. g., MapReduce.

Optimize Application Algorithms:

A universal abstractionand system for resource management can help to scale appli-cations by simplifying and standardizing the process of re-source and task management. In many cases, an improvementof algorithms can lead to even more signiﬁcant improvementscompared to scaling out a non-optimal algorithm to more re-sources (see e. g. [53]).

Limitations of Abstractions:

In many cases, systems arenot limited by conceptual abstraction, but by the implementa-ion of the system and infrastructure. Further, abstractions canexhibit undesirable behaviors. Leaky abstractions describe thephenomena that abstractions frequently fail in real-world set-tings exposing complexities from underlying systems that itmeant to abstract [80].

Re-Use and Interoperability:

A well-designed abstractionis a minimal requirement for developing robust and scalablesoftware systems. By abstracting commonalities between sys-tems, interoperability can be achieved. However, signiﬁcantinvestments into the stability and robustness of the system arerequired to support real-world applications.R

EFERENCES[1] Anthony J. G. Hey, Stewart Tansley, and Kristin M. Tolle.

The FourthParadigm: Data-Intensive Scientiﬁc Discovery . Microsoft, 2009.[2] Geoffrey Fox, Judy Qiu, David Crandall, Gregor Von Laszewski, OliverBeckstein, John Paden, Ioannis Paraskevakos, Shantenu Jha, FushengWang, Madhav Marathe, Anil Vullikanti, and Thomas Cheatham. Con-tributions to high-performance big data computing. In L. Grandinetti,G.R. Joubert, K. Michielsen, S.L. Mirtaheri, M. Taufer, and R Yokota,editors,

Future Trends of HPC in a Disruptive Scenario . IOS Press Vol-ume 34 of Advances in Parallel Computing, 2019.[3] Jeffrey S. Vetter, Ron Brightwell, Maya Gokhale, ..., and JeremiahWilke. Extreme heterogeneity 2018 - productive computational sciencein the era of extreme heterogeneity: Report for doe ascr workshop onextreme heterogeneity.[4] Alfred V. Aho and Jeffrey D. Ullman.

Foundations of Computer Science .Computer Science Press, Inc., USA, 1992.[5] Mary Shaw.

On Conceptual Modelling: Perspectives from Artiﬁcial In-telligence, Databases, and Programming Languages , chapter The Im-pact of Modelling and Abstraction Concerns on Modern ProgrammingLanguages. In

Topics in Information Systems [18], 1984.[6] Andre Luckow, Mark Santcroos, Andre Merzky, Ole Weidner, PradeepMantha, and Shantenu Jha. P*: A model of pilot-abstractions.

IEEE 8thInternational Conference on e-Science , pages 1–10, 2012.http://dx.doi.org/10.1109/eScience.2012.6404423.[7] Juhani Iivari. A paradigmatic analysis of information systems as a designscience.

Scandinavian Journal of Information Systems , 19:39–, 01 2007.https://aisel.aisnet.org/sjis/vol19/iss2/5/.[8] Alan R. Hevner, Salvatore T. March, Jinsoo Park, and Sudha Ram. De-sign science in information systems research.

MIS Quarterly , 28(1):75–105, 2004.[9] Ken Peffers, Tuure Tuunanen, Marcus Rothenberger, and Samir Chat-terjee. A design science research methodology for information systemsresearch.

J. Manage. Inf. Syst. , 24(3):45–77, December 2007.[10] Christian Sonnenberg and Jan vom Brocke. Evaluations in the scienceof the artiﬁcial – reconsidering the build-evaluate pattern in design sci-ence research. In Ken Peffers, Marcus Rothenberger, and Bill Kuechler,editors,

Design Science Research in Information Systems. Advances inTheory and Practice , pages 381–397, Berlin, Heidelberg, 2012. SpringerBerlin Heidelberg.[11] Judith Segal. Models of scientiﬁc software development. In

SECSE 08,First International Workshop on Software Engineering in ComputationalScience and Engineering , May 2008. Workshop co-located with ICSE08 http://icse08.upb.de/.[12] Shantenu Jha, Murray Cole, Daniel S. Katz, Manish Parashar, OmerRana, and Jon Weissman. Distributed computing practice for large-scalescience and engineering applications.

Concurrency and Computation:Practice and Experience , 25(11):1559–1585.[13] Timothy Mattson, Beverly Sanders, and Berna Massingill.

Patterns forParallel Programming . Addison-Wesley Professional, ﬁrst edition, 2004.[14] Micah Beck. On the hourglass model.

Commun. ACM , 62(7):48–57,June 2019.[15] Joshua Bloch. How to design a good api and why it matters. In

Com-panion to the 21st ACM SIGPLAN Symposium on Object-Oriented Pro-gramming Systems, Languages, and Applications ∼ jm/2507S/Readings/CM+Telos.pdf, 1992. [17] Jeff Johnson and Austin Henderson. Conceptual models: Begin by de-signing what to design. Interactions , 9(1):25–32, January 2002.[18] Michael L. Brodie, John Mylopoulos, and Joachim W. Schmidt.

On Con-ceptual Modelling: Perspectives from Artiﬁcial Intelligence, Databases,and Programming Languages . Topics in Information Systems. SpringerNew York, 1984.[19] Bernd Bruegge and Allen H. Dutoit.

Object-Oriented Software Engi-neering Using UML, Patterns, and Java . Prentice Hall Press, UpperSaddle River, NJ, USA, 3rd edition, 2009.[20] M. Shaw. The coming-of-age of software architecture research. In

Pro-ceedings of the 23rd International Conference on Software Engineering.ICSE 2001 , pages 657–664a, May 2001.[21] C. Alexander, P.D.A.C. Alexander, S. Ishikawa, M. Silverstein, M. Ja-cobson, Center for Environmental Structure, I. Fiksdahl-King, andA. Shlomo.

A Pattern Language: Towns, Buildings, Construction . Centerfor Environmental Structure Berkeley, Calif: Center for EnvironmentalStructure series. OUP USA, 1977.[22] Kent Beck and Ward Cunningham. Using pattern languages for objectoriented programs. In

Conference on Object-Oriented Programming,Systems, Languages, and Applications (OOPSLA) , 1987.[23] David Lorge Parnas. On the criteria to be used in decomposing systemsinto modules.

Commun. ACM , 15(12):1053–1058, December 1972.[24] David Lorge Parnas. Information distribution aspects of design method-ology.

Methods , 4(5):6–7, 1971.[25] David Garlan and Mary Shaw. An introduction to software architecture.Technical report, Pittsburgh, PA, USA, 1994.[26] I. Foster and C. Kesselman.

The Grid 2: Blueprint for a New ComputingInfrastructure . ISSN. Elsevier Science, 2003.[27] Judy Qiu, Shantenu Jha, Andre Luckow, and Geoffrey C. Fox. Towardshpc-abds: An initial high-performance big data stack. In

Proceedings ofACM Big Data Interoperability Framework Workshop , 2015.[28] Kathleen M. Eisenhardt. Building theories from case study research.

The Academy of Management Review , 14(4):532–550, 1989.[29] Raj Jain.

The art of computer systems performance analysis - techniquesfor experimental design, measurement, simulation, and modeling.

Wileyprofessional computing. Wiley, 1991.[30] D. Ferrari.

Computer Systems Performance Evaluation . Prentice-Hall,1978.[31] Jim Gray.

Benchmark Handbook: For Database and Transaction Pro-cessing Systems . Morgan Kaufmann, San Francisco, CA, USA, 1992.[32] Andr´e Luckow, George Chantzialexiou, and Shantenu Jha. Pilot-streaming: A stream processing framework for high-performance com-puting.

IEEE eScience International Conference , abs/1801.08648, 2018.[33] W. Buchholz. A synthetic job for measuring system performance.

IBMSystems Journal , 8(4):309–318, 1969.[34] J. W. Anderson, K. E. Kennedy, L. B. Ngo, A. Luckow, and A. W.Apon. Synthetic data generation for the internet of things. In , 2014.[35] Andre Merzky, Ming Tai Ha, Matteo Turilli, and Shantenu Jha. Synapse:Synthetic application proﬁler and emulator.

Journal of ComputationalScience , 27:329 – 344, 2018.[36] Adolfy Hoisie. Performance modeling overview. Talk atPAM 2018: Performance Analysis and Modeling Workshop:https://indico.bnl.gov/event/3950/contributions/12021/attachments/10817/13215/Talk at the Perf Workshop Feb 2018.pdf, 2018.[37] A. Bordgida, J. Mylopoulos, and H. K. T. Wong.

On Conceptual Mod-elling: Perspectives from Artiﬁcial Intelligence, Databases, and Pro-gramming Languages , chapter Generalization/Specialization as a Basisfor Software Speciﬁcation. In

Topics in Information Systems [18], 1984.[38] Danilo Bzdok, Naomi Altman, and Martin Krzywinski. Statistics versusmachine learning.

Nature Methods , 15:233 EP –, 04 2018.[39] P. Grifﬁths Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, andT. G. Price. Access path selection in a relational database managementsystem. In

Proceedings of the ACM SIGMOD International Conferenceon Management of Data , SIGMOD ’79, NY, NY, USA, 1979. ACM.[40] Gene M. Amdahl. Validity of the single processor approach to achievinglarge scale computing capabilities. In

Proceedings of the April 18-20,1967, Spring Joint Computer Conference , AFIPS ’67 (Spring), pages483–485, New York, NY, USA, 1967. ACM.[41] Geoffrey C. Fox, James A. Glazier, J. C. S. Kadupitiya, Vikram Jad-hao, Minje Kim, Judy Qiu, James P. Sluka, Endre T. Somogyi, Mad-hav Marathe, Abhijin Adiga, Jiangzhuo Chen, Oliver Beckstein, andShantenu Jha. Learning everywhere: Pervasive machine learning foreffective high-performance computation.

CoRR , abs/1902.10810, 2019.42] Engineering National Academies of Sciences and Medicine.

Future Di-rections for NSF Advanced Computing Infrastructure to Support U.S.Science and Engineering in 2017-2020 . The National Academies Press,Washington, DC, 2016.[43] Krste Asanovi´c, Ras Bodik, Bryan Christopher Catanzaro, Joseph JamesGebis, Parry Husbands, Kurt Keutzer, David A. Patterson,William Lester Plishker, John Shalf, Samuel Webb Williams, andKatherine A. Yelick. The landscape of parallel computing research: Aview from berkeley. Technical Report UCB/EECS-2006-183, EECSDepartment, University of California, Berkeley, Dec 2006.[44] Shantenu Jha, Daniel S. Katz, Andre Luckow, Neil Chue Hong , OmerRana, and Yogesh Simmhan. Introducing distributed dynamic data-intensive (d3) science: Understanding applications and infrastructure.

Concurrency and Computation: Practice and Experience , 29(8), 2017.[45] Geoffrey C. Fox, Shantenu Jha, Judy Qiu, and Andre Luckow. Towardsan understanding of facets and exemplars of big data applications. In

Proceedings of Beowulf’14 , Annapolis, MD, USA, 2014. ACM.[46] Geoffrey C. Fox, Shantenu Jha, Judy Qiu, and Andre Luckow. A sys-tematic approach to big data benchmarks. In Lucio Grandinetti, GerhardJoubert, Marcel Kunze, and Valerio Pascucci, editors,

Big Data and HighPerformance Computing , volume 24, pages 47–66. IOS Press, M¨unchen,2015. http://dx.doi.org/10.3233/978-1-61499-583-8-47.[47] NIST BigData Working Group. http://bigdatawg.nist.gov/usecases.php,2019.[48] Andre Luckow, Shantenu Jha, Joohyun Kim, Andre Merzky, and Bet-tina Schnor. Adaptive Replica-Exchange Simulations.

Royal SocietyPhilosophical Transactions A , pages 2595–2606, jun 2009.[49] David W. Wright, Benjamin A. Hall, Owain A. Kenway, Shantenu Jha,and Peter V. Coveney. Computing clinically relevant binding free en-ergies of hiv-1 protease inhibitors.

Journal of Chemical Theory andComputation , 10(3):1228–1241, 2014. PMID: 24683369.[50] Yaakoub El-Khamra and Shantenu Jha. Developing autonomic dis-tributed scientiﬁc applications: A case study from history matching usingensemble kalman-ﬁlters. In

Proceedings of the 6th International Con-ference Industry Session on Grids Meets Autonomic Computing , GMAC’09, pages 19–28, New York, NY, USA, 2009. ACM.[51] Sharath Maddineni, Joohyun Kim, Yaakoub El-Khamra, and ShantenuJha. Distributed application runtime environment (dare): A standards-based middleware framework for science-gateways.

Journal of GridComputing , 10(4):647–664, 2012.[52] G. C. Fox, J. Qiu, S. Kamburugamuve, S. Jha, and A. Luckow. Hpc-abds high performance computing enhanced apache big data stack. In , pages 1057–1066, May 2015.[53] Ioannis Paraskevakos, Andre Luckow, Mahzad Khoshlessan, GeorgeChantzialexiou, Thomas E. Cheatham, Oliver Beckstein, Geoffrey C.Fox, and Shantenu Jha. Task-parallel analysis of molecular dynamicstrajectories. In

Proceedings of the 47th International Conference onParallel Processing , ICPP 2018, New York, NY, USA, 2018. ACM.[54] Pradeep Kumar Mantha, Andre Luckow, and Shantenu Jha. Pilot-MapReduce: An Extensible and Flexible MapReduce Implementationfor Distributed Data. In

Proceedings of third international workshop onMapReduce and its Applications , MapReduce ’12, pages 17–24, NewYork, NY, USA, 2012. ACM.[55] Shantenu Jha, Judy Qiu, Andr´e Luckow, Pradeep Kumar Mantha, andGeoffrey Charles Fox. A tale of two data-intensive paradigms: Appli-cations, abstractions, and architectures.

Proceedings of 3rd IEEE Inter-nation Congress of Big Data , abs/1403.1528, 2014.[56] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simpliﬁed data pro-cessing on large clusters.

Commun. ACM , 51(1):107–113, January 2008.[57] William Robert Sutherland.

The on-line graphical speciﬁcation of com-puter procedures.

PhD thesis, MIT, 1966.[58] D. C. DiNucci and R. G. Babb. Design and implementation of parallelprograms with lgdf2. In

Digest of Papers. COMPCON Spring 89. Thirty-Fourth IEEE Computer Society International Conference: IntellectualLeverage , pages 102–107, Feb 1989.[59] Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fet-terly. Dryad: distributed data-parallel programs from sequential buildingblocks.

SIGOPS Oper. Syst. Rev. , 41(3):59–72, 2007.[60] Jaliya Ekanayake, Hui Li, Bingjing Zhang, Thilina Gunarathne, Seung-Hee Bae, Judy Qiu, and Geoffrey Fox. Twister: A runtime for iterativemapreduce. In

Proceedings of the 19th ACM International Symposiumon High Performance Distributed Computing , HPDC ’10, pages 810–818, New York, NY, USA, 2010. ACM. [61] Geoffrey Fox, Shantenu Jha, and Lavanya Ramakrishnan. Stream 2015ﬁnal report. http://streamingsystems.org/ﬁnalreport.pdf, 2015.[62] Anand Jayarajan, Jinliang Wei, Garth Gibson, Alexandra Fedorova, andGennady Pekhimenko. Priority-based parameter propagation for dis-tributed dnn training. In

Proceedings of SysML , 05 2019.[63] Andre Luckow, Lukas Lacinski, and Shantenu Jha. SAGA BigJob: AnExtensible and Interoperable Pilot-Job Abstraction for Distributed Appli-cations and Systems. In

The 10th IEEE/ACM International Symposiumon Cluster, Cloud and Grid Computing , pages 135–144, 2010.[64] James Frey, Todd Tannenbaum, Miron Livny, Ian Foster, and StevenTuecke. Condor-g: A computation management agent for multi-institutional grids.

Cluster Computing , 5(3):237–246, July 2002.[65] Matteo Turilli, Mark Santcroos, and Shantenu Jha. A comprehensiveperspective on pilot-job systems.

ACM Comput. Surv. , 51(2):43:1–43:32,April 2018.[66] Andre Luckow, Mark Santcroos, Ashley Zebrowski, and Shantenu Jha.Pilot-data: An abstraction for distributed data.

Journal of Parallel andDistributed Computing , 2014.[67] Andre Luckow, Pradeep Kumar Mantha, and Shantenu Jha. Pilot-abstraction: A valid abstraction for data-intensive applications on hpc,hadoop and cloud infrastructures?

CoRR , abs/1501.05041, 2015.[68] A. Luckow, I. Paraskevakos, G. Chantzialexiou, and S. Jha. Hadoopon HPC: Integrating Hadoop and Pilot-based Dynamic Resource Man-agement.

IEEE International Workshop on High-Performance Big DataComputing in conjunction with The 30th IEEE International Paralleland Distributed Processing Symposium (IPDPS 2016) , 2016.[69] Erich Gamma, Richard Helm, Ralph Johnson, and John M. Vlis-sides.

Design Patterns: Elements of Reusable Object-Oriented Software .Addison-Wesley Professional, 1 edition, 1994.[70] Andre Merzky, Ole Weidner, and Shantenu Jha. SAGA: A standard-ized access layer to heterogeneous distributed computing infrastructure˙

Software-X

Philosoph-ical Transactions of the Royal Society A: Mathematical, Physical andEngineering Sciences , 369(1949):3318–3335, 2011.[73] Andre Luckow and Shantenu Jha. Performance characterization andmodeling of serverless and hpc streaming applications. In

Proceedingsof StreamML Workshop at IEEE International Conference on Big Data(IEEE BigData 2019) , 2019.[74] Andre Merzky, Matteo Turilli, Manuel Maldonado, Mark Santcroos, andShantenu Jha. Using pilot systems to execute many task workloads onsupercomputers. In

Job Scheduling Strategies for Parallel Processing ,pages 61–82, Cham, 2019. Springer International Publishing.[75] M. Hauck.

Automated Experiments for Deriving Performance-relevantProperties of Software Execution Environments: . The Karlsruhe Serieson Software Design and Quality. KIT Scientiﬁc Publishing, 2014.[76] Jan Waller. Performance benchmarking of application monitoring frame-works. https://macau.uni-kiel.de/receive/diss mods 00016245, 2015.[77] Jack A. Smith, Melissa Romanus, Pradeep Kumar Mantha, YaakoubEl Khamra, Thomas C. Bishop, and Shantenu Jha. Scalable online com-parative genomics of mononucleosomes: A bigjob. In

Proceedings ofthe Conference on Extreme Science and Engineering Discovery Environ-ment: Gateway to Discovery , XSEDE ’13, NY, NY, USA, 2013. ACM.[78] Matteo Turilli, Vivek Balasubramanian, Andre Merzky, Ioannis Paraske-vakos, and Shantenu Jha. Middleware building blocks for workﬂowsystems.

Computing in Science & Engineering (CiSE) special issue onIncorporating Scientiﬁc Workﬂows in Computing Research Processes .10.1109/MCSE.2019.2920048 (2019).[79] Shantenu Jha, Daniel S. Katz, Andre Luckow, Andre Merzky, and Ka-terina Stamou.

Understanding Scientiﬁc Applications for Cloud Envi-ronments , pages 345–371. John Wiley and Sons, 1 2011.[80] Joel Spolsky. The law of leaky abstractions.