Torre Wenaus | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Torre Wenaus is active.

Explore More

Publication

Featured researches published by Torre Wenaus.

Journal of Physics: Conference Series | 2007

The open science grid

R. Pordes; D. Petravick; Bill Kramer; Doug Olson; Miron Livny; Alain Roy; P. Avery; K. Blackburn; Torre Wenaus; F. Würthwein; Ian T. Foster; Robert Gardner; Michael Wilde; Alan Blatecky; John McGee; Rob Quick

The Open Science Grid (OSG) provides a distributed facility where the Consortium members provide guaranteed and opportunistic access to shared computing and storage resources. OSG provides support for and evolution of the infrastructure through activities that cover operations, security, software, troubleshooting, addition of new capabilities, and support for existing and engagement with new communities. The OSG SciDAC-2 project provides specific activities to manage and evolve the distributed infrastructure and support its use. The innovative aspects of the project are the maintenance and performance of a collaborative (shared & common) petascale national facility over tens of autonomous computing sites, for many hundreds of users, transferring terabytes of data a day, executing tens of thousands of jobs a day, and providing robust and usable resources for scientific groups of all types and sizes. More information can be found at the OSG web site: www.opensciencegrid.org.

Journal of Physics: Conference Series | 2011

Overview of ATLAS PanDA Workload Management

T. Maeno; K. De; Torre Wenaus; P. Nilsson; G. A. Stewart; R Walker; A Stradling; J Caballero; M Potekhin; D Smith

The Production and Distributed Analysis System (PanDA) plays a key role in the ATLAS distributed computing infrastructure. All ATLAS Monte-Carlo simulation and data reprocessing jobs pass through the PanDA system. We will describe how PanDA manages job execution on the grid using dynamic resource estimation and data replication together with intelligent brokerage in order to meet the scaling and automation requirements of ATLAS distributed computing. PanDA is also the primary ATLAS system for processing user and group analysis jobs, bringing further requirements for quick, flexible adaptation to the rapidly evolving analysis use cases of the early datataking phase, in addition to the high reliability, robustness and usability needed to provide efficient and transparent utilization of the grid for analysis users. We will describe how PanDA meets ATLAS requirements, the evolution of the system in light of operational experience, how the system has performed during the first LHC data-taking phase and plans for the future.

Journal of Physics: Conference Series | 2011

The ATLAS PanDA Pilot in Operation

P. Nilsson; J Caballero; K. De; T. Maeno; A Stradling; Torre Wenaus

The Production and Distributed Analysis system (PanDA) [1-2] was designed to meet ATLAS [3] requirements for a data-driven workload management system capable of operating at LHC data processing scale. Submitted jobs are executed on worker nodes by pilot jobs sent to the grid sites by pilot factories. This paper provides an overview of the PanDA pilot [4] system and presents major features added in light of recent operational experience, including multi-job processing, advanced job recovery for jobs with output storage failures, gLExec [5-6] based identity switching from the generic pilot to the actual user, and other security measures. The PanDA system serves all ATLAS distributed processing and is the primary system for distributed analysis; it is currently used at over 100 sites worldwide. We analyze the performance of the pilot system in processing real LHC data on the OSG [7], EGI [8] and Nordugrid [9-10] infrastructures used by ATLAS, and describe plans for its evolution.

Journal of Physics: Conference Series | 2015

The ATLAS Event Service: A new approach to event processing

P. Calafiura; K. De; Wen Guan; T. Maeno; P. Nilsson; Danila Oleynik; S. Panitkin; V. Tsulaia; P. van Gemmeren; Torre Wenaus

The ATLAS Event Service (ES) implements a new fine grained approach to HEP event processing, designed to be agile and efficient in exploiting transient, short-lived resources such as HPC hole-filling, spot market commercial clouds, and volunteer computing. Input and output control and data flows, bookkeeping, monitoring, and data storage are all managed at the event level in an implementation capable of supporting ATLAS-scale distributed processing throughputs (about 4M CPU-hours/day). Input data flows utilize remote data repositories with no data locality or pre-staging requirements, minimizing the use of costly storage in favor of strongly leveraging powerful networks. Object stores provide a highly scalable means of remotely storing the quasi-continuous, fine grained outputs that give ES based applications a very light data footprint on a processing resource, and ensure negligible losses should the resource suddenly vanish. We will describe the motivations for the ES system, its unique features and capabilities, its architecture and the highly scalable tools and technologies employed in its implementation, and its applications in ATLAS processing on HPCs, commercial cloud resources, volunteer computing, and grid resources.Notice: This manuscript has been authored by employees of Brookhaven Science Associates, LLC under Contract No. DE-AC02-98CH10886 with the U.S. Department of Energy. The publisher by accepting the manuscript for publication acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes.

Journal of Physics: Conference Series | 2015

The future of PanDA in ATLAS distributed computing

K. De; A. Klimentov; T. Maeno; P. Nilsson; Danila Oleynik; S. Panitkin; Artem Petrosyan; J. Schovancova; A. Vaniachine; Torre Wenaus

Experiments at the Large Hadron Collider (LHC) face unprecedented computing challenges. Heterogeneous resources are distributed worldwide at hundreds of sites, thousands of physicists analyse the data remotely, the volume of processed data is beyond the exabyte scale, while data processing requires more than a few billion hours of computing usage per year. The PanDA (Production and Distributed Analysis) system was developed to meet the scale and complexity of LHC distributed computing for the ATLAS experiment. In the process, the old batch job paradigm of locally managed computing in HEP was discarded in favour of a far more automated, flexible and scalable model. The success of PanDA in ATLAS is leading to widespread adoption and testing by other experiments. PanDA is the first exascale workload management system in HEP, already operating at more than a million computing jobs per day, and processing over an exabyte of data in 2013. There are many new challenges that PanDA will face in the near future, in addition to new challenges of scale, heterogeneity and increasing user base. PanDA will need to handle rapidly changing computing infrastructure, will require factorization of code for easier deployment, will need to incorporate additional information sources including network metrics in decision making, be able to control network circuits, handle dynamically sized workload processing, provide improved visualization, and face many other challenges. In this talk we will focus on the new features, planned or recently implemented, that are relevant to the next decade of distributed computing workload management using PanDA.

Journal of Physics: Conference Series | 2012

Evolution of the ATLAS PanDA Production and Distributed Analysis System

T. Maeno; K. De; Torre Wenaus; P. Nilsson; R Walker; A Stradling; V Fine; M Potekhin; S. Panitkin; G Compostella

The PanDA (Production and Distributed Analysis) system has been developed to meet ATLAS production and analysis requirements for a data-driven workload management system capable of operating at LHC data processing scale. PanDA has performed well with high reliability and robustness during the two years of LHC data-taking, while being actively evolved to meet the rapidly changing requirements for analysis use cases. We will present an overview of system evolution including automatic rebrokerage and reattempt for analysis jobs, adaptation for the CernVM File System, support for the multi-cloud model through which Tier-2 sites act as members of multiple clouds, pledged resource management and preferential brokerage, and monitoring improvements. We will also describe results from the analysis of two years of PanDA usage statistics, current issues, and plans for the future.

Journal of Physics: Conference Series | 2011

The ATLAS PanDA Monitoring System and its Evolution

Alexei Klimentov; P. Nevski; M Potekhin; Torre Wenaus

The PanDA (Production and Distributed Analysis) Workload Management System is used for ATLAS distributed production and analysis worldwide. The needs of ATLAS global computing imposed challenging requirements on the design of PanDA in areas such as scalability, robustness, automation, diagnostics, and usability for both production shifters and analysis users. Through a system-wide job database, the PanDA monitor provides a comprehensive and coherent view of the system and job execution, from high level summaries to detailed drill-down job diagnostics. It is (like the rest of PanDA) an Apache-based Python application backed by Oracle. The presentation layer is HTML code generated on the fly in the Python application which is also responsible for managing database queries. However, this approach is lacking in user interface flexibility, simplicity of communication with external systems, and ease of maintenance. A decision was therefore made to migrate the PanDA monitor server to Django Web Application Framework and apply JSON/AJAX technology in the browser front end. This allows us to greatly reduce the amount of application code, separate data preparation from presentation, leverage open source for tools such as authentication and authorization mechanisms, and provide a richer and more dynamic user experience. We describe our approach, design and initial experience with the migration process.

Presented at International Conference on Computing in High Energy and Nuclear Physics (CHEP 07), Victoria, BC, Canada, 2-7 Sep 2007 | 2008

The Open Science Grid status and architecture

The Open Science Grid (OSG) provides a distributed facility where the Consortium members provide guaranteed and opportunistic access to shared computing and storage resources. The OSG project[1] is funded by the National Science Foundation and the Department of Energy Scientific Discovery through Advanced Computing program. The OSG project provides specific activities for the operation and evolution of the common infrastructure. The US ATLAS and US CMS collaborations contribute to and depend on OSG as the US infrastructure contributing to the World Wide LHC Computing Grid on which the LHC experiments distribute and analyze their data. Other stakeholders include the STAR RHIC experiment, the Laser Interferometer Gravitational-Wave Observatory (LIGO), the Dark Energy Survey (DES) and several Fermilab Tevatron experiments- CDF, D0, MiniBoone etc. The OSG implementation architecture brings a pragmatic approach to enabling vertically integrated community specific distributed systems over a common horizontal set of shared resources and services. More information can be found at the OSG web site: www.opensciencegrid.org.

21st International Conference on Computing in High Energy and Nuclear Physics, CHEP 2015 | 2015

Integration of PanDA workload management system with Titan supercomputer at OLCF

K. De; A. Klimentov; Danila Oleynik; S. Panitkin; A. Petrosyan; J. Schovancova; A. Vaniachine; Torre Wenaus

The PanDA (Production and Distributed Analysis) workload management system (WMS) was developed to meet the scale and complexity of LHC distributed computing for the ATLAS experiment. While PanDA currently distributes jobs to more than 100,000 cores at well over 100 Grid sites, the future LHC data taking runs will require more resources than Grid computing can possibly provide. To alleviate these challenges, ATLAS is engaged in an ambitious program to expand the current computing model to include additional resources such as the opportunistic use of supercomputers.We will describe a project aimed at integration of PanDA WMS with Titan supercomputer at Oak Ridge Leadership Computing Facility (OLCF). The current approach utilizes a modified PanDA pilot framework for job submission to Titans batch queues and local data management, with light-weight MPI wrappers to run single threaded workloads in parallel on Titans multicore worker nodes. It also gives PanDA new capability to collect, in real time, information about unused worker nodes on Titan, which allows precise definition of the size and duration of jobs submitted to Titan according to available free resources. This capability significantly reduces PanDA job wait time while improving Titans utilization efficiency. This implementation was tested with a variety of Monte-Carlo workloads on Titan and is being tested on several other supercomputing platforms.Notice: This manuscript has been authored, by employees of Brookhaven Science Associates, LLC under Contract No. DE-AC02-98CH10886 with the U.S. Department of Energy. The publisher by accepting the manuscript for publication acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes.

21st International Conference on Computing in High Energy and Nuclear Physics, CHEP 2015 | 2015

Fine grained event processing on HPCs with the ATLAS Yoda system

P. Calafiura; K. De; Wen Guan; T. Maeno; P. Nilsson; Danila Oleynik; S. Panitkin; V. Tsulaia; Peter van Gemmeren; Torre Wenaus

High performance computing facilities present unique challenges and opportunities for HEP event processing. The massive scale of many HPC systems means that fractionally small utilization can yield large returns in processing throughput. Parallel applications which can dynamically and efficiently fill any scheduling opportunities the resource presents benefit both the facility (maximal utilization) and the (compute-limited) science. The ATLAS Yoda system provides this capability to HEP-like event processing applications by implementing event-level processing in an MPI-based master-client model that integrates seamlessly with the more broadly scoped ATLAS Event Service. Fine grained, event level work assignments are intelligently dispatched to parallel workers to sustain full utilization on all cores, with outputs streamed off to destination object stores in near real time with similarly fine granularity, such that processing can proceed until termination with full utilization. The system offers the efficiency and scheduling flexibility of preemption without requiring the application actually support or employ check-pointing. We will present the new Yoda system, its motivations, architecture, implementation, and applications in ATLAS data processing at several US HPC centers.

Explore More