Mark Santcroos | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Mark Santcroos is active.

Explore More

Publication

Featured researches published by Mark Santcroos.

international conference on e-science | 2012

P∗: A model of pilot-abstractions

Andre Luckow; Mark Santcroos; Andre Merzky; Ole Weidner; Pradeep Kumar Mantha; Shantenu Jha

Pilot-Jobs support effective distributed resource utilization, and are arguably one of the most widely-used distributed computing abstractions - as measured by the number and types of applications that use them, as well as the number of production distributed cyberinfrastructures that support them. In spite of broad uptake, there does not exist a well-defined, unifying conceptual model of Pilot-Jobs which can be used to define, compare and contrast different implementations. Often Pilot-Job implementations are strongly coupled to the distributed cyber-infrastructure they were originally designed for. These factors present a barrier to extensibility and interoperability. This paper is an attempt to (i) provide a minimal but complete model (P*) of Pilot-Jobs, (ii) establish the generality of the P* Model by mapping various existing and well known Pilot-Job frameworks such as Condor and DIANE to P*, (iii) derive an interoperable and extensible API for the P* Model (Pilot-API), (iv) validate the implementation of the Pilot-API by concurrently using multiple distinct Pilot-Job frameworks on distinct production distributed cyberinfrastructures, and (v) apply the P* Model to Pilot-Data.

grid computing | 2012

A Grid-Enabled Gateway for Biomedical Data Analysis

Shayan Shahand; Mark Santcroos; Antoine H. C. van Kampen; Sílvia Delgado Olabarriaga

Biomedical researchers can leverage Grid computing technology to address their increasing demands for data- and compute-intensive data analysis. However, usage of existing Grid infrastructures remains difficult for them. The e-infrastructure for biomedical science (e-BioInfra) is a platform with services that shield middleware complexities, in particular workflow management and monitoring. These services can be invoked from a web-based interface, called e-BioInfra Gateway, to perform large scale data analysis experiments, such that the biomedical researchers can focus on their own research problems. The gateway was designed to simplify usage both by biomedical researchers and e-BioInfra administrators, and to support straightforward extensions with new data analysis methods. In this paper we present the architecture and implementation of the gateway, also showing statistics for its usage. We also share lessons learned during the gateway development and operation. The gateway is currently used in several biomedical research projects and in teaching medical students the principles of data analysis.

high performance distributed computing | 2012

Towards a common model for pilot-jobs

Andre Luckow; Mark Santcroos; Ole Weidner; Andre Merzky; Sharath Maddineni; Shantenu Jha

Pilot-Jobs have become one of the most successful abstractions in distributed computing. In spite of extensive uptake, there does not exist a well defined, unifying conceptual model of pilot-jobs which can be used to define, compare and contrast different implementations. This presents a barrier to extensibility and interoperability. This paper is an attempt to, (i) provide a minimal but complete model (P*) of pilot-jobs, (ii) establish the generality of the P* Model by mapping various existing and well known pilot-jobs frameworks such as Condor and DIANE to P*, (iii) demonstrate the interoperable and concurrent usage of distinct pilot-job frameworks on different production distributed cyberinfrastructures via the use of an extensible API for the P* Model (Pilot-API).

ieee international conference on escience | 2011

A Provenance Approach to Trace Scientific Experiments on a Grid Infrastructure

Ammar Benabdelkader; Mark Santcroos; Souley Madougou; Antoine H. C. van Kampen; Sílvia Delgado Olabarriaga

Large experiments on distributed infrastructures become increasingly complex to manage, in particular to trace all computations that gave origin to a piece of data or an event such as an error. The work presented in this paper describes the design and implementation of an architecture to support experiment provenance and its deployment in the concrete case of a particular e-infrastructure for biosciences. The proposed solution consists of: (a) a data provenance repository to capture scientific experiments and their execution path, (b) a software tool (crawler) that gathers, classifies, links, and stores the information collected from various sources, and (c) a set of user interfaces through which the end-user can access the provenance data, interpret the results, and trace the sources of failure. The approach is based on an OPM-compliant API, PLIER, that is flexible to support future extensions and facilitates interoperability among heterogeneous application systems.

grid computing | 2013

Characterizing workflow-based activity on a production e-infrastructure using provenance data

Souley Madougou; Shayan Shahand; Mark Santcroos; Barbera D. C. van Schaik; Ammar Benabdelkader; Antoine H. C. van Kampen; Sílvia Delgado Olabarriaga

Grid computing and workflow management systems emerged as solutions to the challenges arising from the processing and storage of shear volumes of data generated by modern simulations and data acquisition devices. Workflow management systems usually document the process of the workflow execution either as structured provenance information or as log files. Provenance is recognized as an important feature in workflow management systems, however there are still few reports on its usage in practical cases. In this paper we present the provenance system implemented in our platform, and then use the information captured by this system during 8 months of platform operation to analyze the platform usage and to perform multilevel error pattern analysis. We make use of the large amount of structured data using the explanatory potential of statistical approaches to find properties of workflows, jobs and resources that are related to workflow failure. Such an analysis enables us to characterize workflow executions on the infrastructure and understand workflow failures. The approach is generic and applicable to other e-infrastructures to gain insight into operational incidents.

Studies in health technology and informatics | 2012

Provenance for distributed biomedical workflow execution

Souley Madougou; Mark Santcroos; Ammar Benabdelkader; B. D. C. van Schaik; Shayan Shahand; Vladimir Korkhov; A. H. C. van Kampen; Sílvia Delgado Olabarriaga

Scientific research has become very data and compute intensive because of the progress in data acquisition and measurement devices, which is particularly true in Life Sciences. To cope with this deluge of data, scientists use distributed computing and storage infrastructures. The use of such infrastructures introduces by itself new challenges to the scientists in terms of proper and efficient use. Scientific workflow management systems play an important role in facilitating the use of the infrastructure by hiding some of its complexity. Although most scientific workflow management systems are provenance-aware, not all of them come with provenance functionality out of the box. In this paper we describe the improvement and integration of a provenance system into an e-infrastructure for biomedical research based on the MOTEUR workflow management system. The main contributions of the paper are: presenting an OPM implementation using relational database backend for the provenance store, providing an e-infrastructure with a comprehensive provenance system, defining a generic approach to provenance implementation, potentially suitable for other workflow systems and application domains and demonstrating the value of this system based on use cases presenting the provenance data through a user-friendly web interface.

ACM Computing Surveys | 2018

A Comprehensive Perspective on Pilot-Job Systems

Matteo Turilli; Mark Santcroos; Shantenu Jha

Pilot-Job systems play an important role in supporting distributed scientific computing. They are used to execute millions of jobs on several cyberinfrastructures worldwide, consuming billions of CPU hours a year. With the increasing importance of task-level parallelism in high-performance computing, Pilot-Job systems are also witnessing an adoption beyond traditional domains. Notwithstanding the growing impact on scientific research, there is no agreement on a definition of Pilot-Job system and no clear understanding of the underlying abstraction and paradigm. Pilot-Job implementations have proliferated with no shared best practices or open interfaces and little interoperability. Ultimately, this is hindering the realization of the full impact of Pilot-Jobs by limiting their robustness, portability, and maintainability. This article offers a comprehensive analysis of Pilot-Job systems critically assessing their motivations, evolution, properties, and implementation. The three main contributions of this article are as follows: (1) an analysis of the motivations and evolution of Pilot-Job systems; (2) an outline of the Pilot abstraction, its distinguishing logical components and functionalities, its terminology, and its architecture pattern; and (3) the description of core and auxiliary properties of Pilot-Jobs systems and the analysis of six exemplar Pilot-Job implementations. Together, these contributions illustrate the Pilot paradigm, its generality, and how it helps to address some challenges in distributed scientific computing.

international conference on e-science | 2012

Pilot abstractions for compute, data, and network

Mark Santcroos; Sílvia Delgado Olabarriaga; Daniel S. Katz; Shantenu Jha

Scientific experiments in a variety of domains are producing increasing amounts of data that need to be processed efficiently. Distributed Computing Infrastructures are increasingly important in fulfilling these large-scale computational requirements.

Proceedings of EGI Community Forum 2012 / EMI Second Technical Conference — PoS(EGICF12-EMITC2) | 2012

Challenges in DNA sequence analysis on a production grid

Barbera D. C. van Schaik; Mark Santcroos; Vladimir Korkhov; Aldo Jongejan; Marcel Willemsen; Antoine H. C. van Kampen; Sílvia Delgado Olabarriaga; A. Anonymous

c Copyright owned by the author(s) under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike Licence. Modern DNA sequencing machines produce data in the range of 1-100 GB per experiment and with ongoing technological developments this amount is rapidly increasing. The majority of experiments involve re-sequencing of human genomes and exomes to find genomic regions that are associated with disease. There are many sequence analysis tools freely available, e.g. for sequence alignment, quality control and variant detection, and frequently new tools are developed to address new biological questions. Since 2008 we use workflow technology to allow easy incorporation of such software in our data analysis pipelines, as well as to leverage grid infrastructures for the analysis of large datasets in parallel. The size of the datasets has grown from 1 GB to 70 GB in 3 years, therefore adjustments were needed to optimize these workflows. Procedures have been implemented for faster data transfer to and from grid resources, and for fault recovery at run time. A split-and-merge procedure for a frequently used sequence alignment tool, BWA, resulted in a threefold reduction of the total time needed to complete an experiment and increased efficiency by a reduction in number of failures. The success rate was increased from 10% to 70%. In addition, steps to resubmit workflows for partly failed workflows were automated, which saved user intervention. Here we present our current procedure of analyzing data from DNA sequencing experiments, comment on the experiences and focus on the improvements needed to scale up the analysis of genomics data at our hospital.

ieee/acm international symposium cluster, cloud and grid computing | 2013

Exploring Dynamic Enactment of Scientific Workflows Using Pilot-Abstractions

Mark Santcroos; B. D. C. van Schaik; Shayan Shahand; Sílvia Delgado Olabarriaga; Andre Luckow; Shantenu Jha

Current workflow abstractions in general lack: (a) an adequate approach to handle distributed data and (b) proper separation between logical tasks and data-flow from their mapping onto physical locations. As the complexity and dynamism of data and processing distribution have increased, optimized mapping of logical tasks to physical resources have become a necessity to avoid bottlenecks. We argue that the management of dynamic data and compute should become part of the runtime system of workflow engines to enable workflows to scale as necessary to address big data challenges and fully exploit distributed computing infrastructures (DCI). In this paper we explore how the P* model for pilot-abstractions, which proposes a clear separation between the logical compute and data units and their realization as a job or a file in some physical resource, could provide these capabilities for such a runtime environment. The Pilot-API provides a general-purpose interface to pilot-abstractions and the ability to assign compute and data resources to them. We share our experience of using the case study of a DNA sequencing pipeline, to re-implement the workflow using the Pilot-API. This first exercise, which resulted in a running application that is discussed here, illustrates the potential of this API to address (a) and (b). Our initial results indicate that the pilot abstractions (as captured by the P* model)offer an interesting approach to explore the design of a new generation of workflow management systems and runtime environments that are capable of intelligently deciding on application-aware late binding to physical resources.

Explore More