Doug Benjamin | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Doug Benjamin is active.

Explore More

Publication

Featured researches published by Doug Benjamin.

Journal of Physics: Conference Series | 2012

Scientific Cluster Deployment and Recovery ? Using puppet to simplify cluster management

Val Hendrix; Doug Benjamin; Yushu Yao

Deployment, maintenance and recovery of a scientific cluster, which has complex, specialized services, can be a time consuming task requiring the assistance of Linux system administrators, network engineers as well as domain experts. Universities and small institutions that have a part-time FTE with limited time for and knowledge of the administration of such clusters can be strained by such maintenance tasks. This current work is the result of an effort to maintain a data analysis cluster (DAC) with minimal effort by a local system administrator. The realized benefit is the scientist, who is the local system administrator, is able to focus on the data analysis instead of the intricacies of managing a cluster. Our work provides a cluster deployment and recovery process (CDRP) based on the puppet configuration engine allowing a part-time FTE to easily deploy and recover entire clusters with minimal effort. Puppet is a configuration management system (CMS) used widely in computing centers for the automatic management of resources. Domain experts use Puppets declarative language to define reusable modules for service configuration and deployment. Our CDRP has three actors: domain experts, a cluster designer and a cluster manager. The domain experts first write the puppet modules for the cluster services. A cluster designer would then define a cluster. This includes the creation of cluster roles, mapping the services to those roles and determining the relationships between the services. Finally, a cluster manager would acquire the resources (machines, networking), enter the cluster input parameters (hostnames, IP addresses) and automatically generate deployment scripts used by puppet to configure it to act as a designated role. In the event of a machine failure, the originally generated deployment scripts along with puppet can be used to easily reconfigure a new machine. The cluster definition produced in our CDRP is an integral part of automating cluster deployment in a cloud environment. Our future cloud efforts will further build on this work.

Journal of Physics: Conference Series | 2010

CDF GlideinWMS usage in Grid computing of high energy physics

Marian Zvada; Doug Benjamin; I. Sfiligoi

Many members of large science collaborations already have specialized grids available to advance their research in the need of getting more computing resources for data analysis. This has forced the Collider Detector at Fermilab (CDF) collaboration to move beyond the usage of dedicated resources and start exploiting Grid resources. Nowadays, CDF experiment is increasingly relying on glidein-based computing pools for data reconstruction. Especially, Monte Carlo production and user data analysis, serving over 400 users by central analysis farm middleware (CAF) on the top of Condor batch system and CDF Grid infrastructure. Condor is designed as distributed architecture and its glidein mechanism of pilot jobs is ideal for abstracting the Grid computing by making a virtual private computing pool. We would like to present the first production use of the generic pilot-based Workload Management System (glideinWMS), which is an implementation of the pilot mechanism based on the Condor distributed infrastructure. CDF Grid computing uses glideinWMS for its data reconstruction on the FNAL campus Grid, user analysis and Monte Carlo production across Open Science Grid (OSG). We review this computing model and setup used including CDF specific configuration within the glideinWMS system which provides powerful scalability and makes Grid computing working like in a local batch environment with ability to handle more than 10000 running jobs at a time.

Journal of Physics: Conference Series | 2015

Simulation of LHC events on a millions threads

J. T. Childers; Thomas D. Uram; Thomas LeCompte; Michael E. Papka; Doug Benjamin

Demand for Grid resources is expected to double during LHC Run II as compared to Run I; the capacity of the Grid, however, will not double. The HEP community must consider how to bridge this computing gap by targeting larger compute resources and using the available compute resources as efficiently as possible. Argonnes Mira, the fifth fastest supercomputer in the world, can run roughly five times the number of parallel processes that the ATLAS experiment typically uses on the Grid. We ported Alpgen, a serial x86 code, to run as a parallel application under MPI on the Blue Gene/Q architecture. By analysis of the Alpgen code, we reduced the memory footprint to allow running 64 threads per node, utilizing the four hardware threads available per core on the PowerPC A2 processor. Event generation and unweighting, typically run as independent serial phases, are coupled together in a single job in this scenario, reducing intermediate writes to the filesystem. By these optimizations, we have successfully run LHC proton-proton physics event generation at the scale of a million threads, filling two-thirds of Mira.

Computer Physics Communications | 2017

Adapting the serial Alpgen parton-interaction generator to simulate LHC collisions on millions of parallel threads

J. T. Childers; Thomas D. Uram; Thomas LeCompte; Michael E. Papka; Doug Benjamin

Abstract As the LHC moves to higher energies and luminosity, the demand for computing resources increases accordingly and will soon outpace the growth of the Worldwide LHC Computing Grid. To meet this greater demand, event generation Monte Carlo was targeted for adaptation to run on Mira, the supercomputer at the Argonne Leadership Computing Facility. Alpgen is a Monte Carlo event generation application that is used by LHC experiments in the simulation of collisions that take place in the Large Hadron Collider. This paper details the process by which Alpgen was adapted from a single-processor serial-application to a large-scale parallel-application and the performance that was achieved.

Journal of Physics: Conference Series | 2015

Achieving production-level use of HEP software at the Argonne Leadership Computing Facility

Thomas D. Uram; J. T. Childers; Thomas LeCompte; Michael E. Papka; Doug Benjamin

HEPs demand for computing resources has grown beyond the capacity of the Grid, and these demands will accelerate with the higher energy and luminosity planned for Run II. Mira, the ten petaFLOPs supercomputer at the Argonne Leadership Computing Facility, is a potentially significant compute resource for HEP research. Through an award of fifty million hours on Mira, we have delivered millions of events to LHC experiments by establishing the means of marshaling jobs through serial stages on local clusters, and parallel stages on Mira. We are running several HEP applications, including Alpgen, Pythia, Sherpa, and Geant4. Event generators, such as Sherpa, typically have a split workload: a small scale integration phase, and a second, more scalable, event-generation phase. To accommodate this workload on Mira we have developed two Python-based Django applications, Balsam and ARGO. Balsam is a generalized scheduler interface which uses a plugin system for interacting with scheduler software such as HTCondor, Cobalt, and TORQUE. ARGO is a workflow manager that submits jobs to instances of Balsam. Through these mechanisms, the serial and parallel tasks within jobs are executed on the appropriate resources. This approach and its integration with the PanDA production system will be discussed.

Journal of Physics: Conference Series | 2012

Eurogrid: a new glideinWMS based portal for CDF data analysis

S Amerio; Doug Benjamin; J Dost; G Compostella; D. Lucchesi; I. Sfiligoi

The CDF experiment at Fermilab ended its Run-II phase on September 2011 after 11 years of operations and 10 fb−1 of collected data. CDF computing model is based on a Central Analysis Farm (CAF) consisting of local computing and storage resources, supported by OSG and LCG resources accessed through dedicated portals. At the beginning of 2011 a new portal, Eurogrid, has been developed to effectively exploit computing and disk resources in Europe: a dedicated farm and storage area at the TIER-1 CNAF computing center in Italy, and additional LCG computing resources at different TIER-2 sites in Italy, Spain, Germany and France, are accessed through a common interface. The goal of this project is to develop a portal easy to integrate in the existing CDF computing model, completely transparent to the user and requiring a minimum amount of maintenance support by the CDF collaboration. In this paper we will review the implementation of this new portal, and its performance in the first months of usage. Eurogrid is based on the glideinWMS software, a glidein based Workload Management System (WMS) that works on top of Condor. As CDF CAF is based on Condor, the choice of the glideinWMS software was natural and the implementation seamless. Thanks to the pilot jobs, user-specific requirements and site resources are matched in a very efficient way, completely transparent to the users. Official since June 2011, Eurogrid effectively complements and supports CDF computing resources offering an optimal solution for the future in terms of required manpower for administration, support and development.

Journal of Physics: Conference Series | 2010

A new CDF model for data movement based on SRM

Manoj Kumar Jha; Gabriele Compostella; D. Lucchesi; Simone Pagan Griso; Doug Benjamin

Being a large international collaboration established well before the full development of the Grid as the main computing tool for High Energy Physics, CDF has recently changed and improved its computing model, decentralizing some parts of it in order to be able to exploit the rising number of distributed resources available nowadays. Despite those efforts, while the large majority of CDF Monte Carlo production has moved to the Grid, data processing is still mainly performed in dedicated farms hosted at FNAL, requiring a centralized management of data and Monte Carlo samples needed for physics analysis. This rises the question on how to manage the transfer of produced Monte Carlo samples from remote Grid sites to FNAL in an efficient way; up to now CDF has relied on a non scalable centralized solution based on dedicated data servers accessed through rcp protocol, which has proven to be unsatisfactory. A new data transfer model has been designed that uses SRMs as local caches for remote Monte Carlo production sites, interfaces them with SAM, the experiment data catalog, and finally realizes the file movement exploiting the features provided by the data catalog transfer layer. We describe here the model and its integration within the current CDF computing architecture.

Proceedings of the Fourth International Workshop on HPC User Support Tools | 2017

An Edge Service for Managing HPC Workflows

J. Taylor Childers; Thomas D. Uram; Doug Benjamin; Thomas LeCompte; Michael E. Papka

Large experimental collaborations, such as those at the Large Hadron Collider at CERN, have developed large job management systems running hundreds of thousands of jobs across worldwide computing grids. HPC facilities are becoming more important to these data-intensive workflows and integrating them into the experiment job management system is non-trivial due to increased security and heterogeneous computing environments. The following article describes a common edge service developed and deployed on DOE supercomputers for both small users and large collaborations. This edge service provides a uniform interaction across many different supercomputers. Example users are described with the related performance.

Journal of Physics: Conference Series | 2014

ATLAS Experience with HEP Software at the Argonne Leadership Computing Facility

Thomas D. Uram; Thomas LeCompte; Doug Benjamin

A number of HEP software packages used by the ATLAS experiment, including GEANT4, ROOT and ALPGEN, have been adapted to run on the IBM Blue Gene supercomputers at the Argonne Leadership Computing Facility. These computers use a non-x86 architecture and have a considerably less rich operating environment than in common use in HEP, but also represent a computing capacity an order of magnitude beyond what ATLAS is presently using via the LCG. The status and potential for making use of leadership-class computing, including the status of integration with the ATLAS production system, is discussed.

Journal of Physics: Conference Series | 2012