Oliver Gutsche | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Oliver Gutsche is active.

Explore More

Publication

Featured researches published by Oliver Gutsche.

ieee international conference on high performance computing data and analytics | 2007

The CMS remote analysis builder (CRAB)

D. Spiga; Stefano Lacaprara; W. Bacchi; Mattia Cinquilli; G. Codispoti; Marco Corvo; A. Dorigo; A. Fanfani; Federica Fanzago; F. M. Farina; M. Merlo; Oliver Gutsche; L. Servoli; C. Kavka

The CMS experiment will produce several Pbytes of data every year, to be distributed over many computing centers geographically distributed in different countries. Analysis of this data will be also performed in a distributed way, using grid infrastructure. CRAB (CMS Remote Analysis Builder) is a specific tool, designed and developed by the CMS collaboration, that allows a transparent access to distributed data to end physicist. Very limited knowledge of underlying technicalities are required to the user. CRAB interacts with the local user environment, the CMS Data Management services and with the Grid middleware. It is able to use WLCG, gLite and OSG middleware. CRAB has been in production and in routine use by end-users since Spring 2004. It has been extensively used in studies to prepare the Physics Technical Design Report (PTDR) and in the analysis of reconstructed event samples generated during the Computing Software and Analysis Challenge (CSA06). This involved generating thousands of jobs per day at peak rates. In this paper we discuss the current implementation of CRAB, the experience with using it in production and the plans to improve it in the immediate future.

Journal of Physics: Conference Series | 2012

A new era for central processing and production in CMS

E Fajardo; Oliver Gutsche; S Foulkes; J Linacre; V Spinoso; A Lahiff; G. Gomez-Ceballos; M. Klute; Ajit Mohapatra

The goal for CMS computing is to maximise the throughput of simulated event generation while also processing event data generated by the detector as quickly and reliably as possible. To maintain this achievement as the quantity of events increases CMS computing has migrated at the Tier 1 level from its old production framework, ProdAgent, to a new one, WMAgent. The WMAgent framework offers improved processing efficiency and increased resource usage as well as a reduction in operational manpower. In addition to the challenges encountered during the design of the WMAgent framework, several operational issues have arisen during its commissioning. The largest operational challenges were in the usage and monitoring of resources, mainly a result of a change in the way work is allocated. Instead of work being assigned to operators, all work is centrally injected and managed in the Request Manager system and the task of the operators has changed from running individual workflows to monitoring the global workload. In this report we present how we tackled some of the operational challenges, and how we benefitted from the lessons learned in the commissioning of the WMAgent framework at the Tier 2 level in late 2011. As case studies, we will show how the WMAgent system performed during some of the large data reprocessing and Monte Carlo simulation campaigns.

Journal of Physics: Conference Series | 2014

CMS computing operations during run 1

J Adelman; S. Alderweireldt; J Artieda; G. Bagliesi; D Ballesteros; S. Bansal; L. A. T. Bauerdick; W Behrenhof; S. Belforte; K. Bloom; B. Blumenfeld; S. Blyweert; D. Bonacorsi; C. Brew; L Contreras; A Cristofori; S Cury; D da Silva Gomes; M Dolores Saiz Santos; J Dost; David Dykstra; E Fajardo Hernandez; F Fanzango; I. Fisk; J Flix; A Georges; M. Giffels; G. Gomez-Ceballos; S. J. Gowdy; Oliver Gutsche

During the first run, CMS collected and processed more than 10B data events and simulated more than 15B events. Up to 100k processor cores were used simultaneously and 100PB of storage was managed. Each month petabytes of data were moved and hundreds of users accessed data samples. In this document we discuss the operational experience from this first run. We present the workflows and data flows that were executed, and we discuss the tools and services developed, and the operations and shift models used to sustain the system. Many techniques were followed from the original computing planning, but some were reactions to difficulties and opportunities. We also address the lessons learned from an operational perspective, and how this is shaping our thoughts for 2015.

Journal of Physics: Conference Series | 2015

Using the glideinWMS System as a Common Resource Provisioning Layer in CMS

J Balcas; S Belforte; B Bockelman; D Colling; Oliver Gutsche; Dirk Hufnagel; F Khan; K Larson; J Letts; M Mascheroni; D Mason; A McCrea; S Piperov; M Saiz-Santos; I Sfiligoi; A Tanasijczuk; C Wissing

CMS will require access to more than 125k processor cores for the beginning of Run 2 in 2015 to carry out its ambitious physics program with more and higher complexity events. During Run1 these resources were predominantly provided by a mix of grid sites and local batch resources. During the long shut down cloud infrastructures, diverse opportunistic resources and HPC supercomputing centers were made available to CMS, which further complicated the operations of the submission infrastructure. In this presentation we will discuss the CMS effort to adopt and deploy the glideinWMS system as a common resource provisioning layer to grid, cloud, local batch, and opportunistic resources and sites. We will address the challenges associated with integrating the various types of resources, the efficiency gains and simplifications associated with using a common resource provisioning layer, and discuss the solutions found. We will finish with an outlook of future plans for how CMS is moving forward on resource provisioning for more heterogenous architectures and services.

Journal of Physics: Conference Series | 2014

Evolution of the pilot infrastructure of CMS: towards a single glideinWMS pool

S. Belforte; Oliver Gutsche; J. Letts; K Majewski; A McCrea; I. Sfiligoi

CMS production and analysis job submission is based largely on glideinWMS and pilot submissions. The transition from multiple different submission solutions like gLite WMS and HTCondor-based implementations was carried out over years and is coming now to a conclusion. The historically explained separate glideinWMS pools for different types of production jobs and analysis jobs are being unified into a single global pool. This enables CMS to benefit from global prioritization and scheduling possibilities. It also presents the sites with only one kind of pilots and eliminates the need of having to make scheduling decisions on the CE level. This paper provides an analysis of the benefits of a unified resource pool, as well as a description of the resulting global policy. It will explain the technical challenges moving forward and present solutions to some of them.

Journal of Physics: Conference Series | 2012

CMS Data Transfer operations after the first years of LHC collisions

R Kaselis; S Piperov; N Magini; J Flix; Oliver Gutsche; P Kreuzer; M Yang; S Liu; N Ratnikova; A Sartirana; D Bonacorsi; J. Letts

CMS experiment utilizes distributed computing infrastructure and its performance heavily depends on the fast and smooth distribution of data between different CMS sites. Data must be transferred from the Tier-0 (CERN) to the Tier-1s for processing, storing and archiving, and time and good quality are vital to avoid overflowing CERN storage buffers. At the same time, processed data has to be distributed from Tier-1 sites to all Tier-2 sites for physics analysis while Monte Carlo simulations sent back to Tier-1 sites for further archival. At the core of all transferring machinery is PhEDEx (Physics Experiment Data Export) data transfer system. It is very important to ensure reliable operation of the system, and the operational tasks comprise monitoring and debugging all transfer issues. Based on transfer quality information Site Readiness tool is used to create plans for resources utilization in the future. We review the operational procedures created to enforce reliable data delivery to CMS distributed sites all over the world. Additionally, we need to keep data and meta-data consistent at all sites and both on disk and on tape. In this presentation, we describe the principles and actions taken to keep data consistent on sites storage systems and central CMS Data Replication Database (TMDB/DBS) while ensuring fast and reliable data samples delivery of hundreds of terabytes to the entire CMS physics community.

arXiv: Distributed, Parallel, and Cluster Computing | 2017

Big Data in HEP: A comprehensive use case study

Oliver Gutsche; Jim Pivarski; Jim Kowalkowski; Nhan Tran; A. Svyatkovskiy; Matteo Cremonesi; P. Elmer; Bo Jayatilaka; Saba Sehrish; Cristina Mantilla Suarez

Experimental Particle Physics has been at the forefront of analyzing the worlds largest datasets for decades. The HEP community was the first to develop suitable software and computing tools for this task. In recent times, new toolkits and systems collectively called Big Data technologies have emerged to support the analysis of Petabyte and Exabyte datasets in industry. While the principles of data analysis in HEP have not changed (filtering and transforming experiment-specific data formats), these new technologies use different approaches and promise a fresh look at analysis of very large datasets and could potentially reduce the time-to-physics with increased interactivity. In this talk, we present an active LHC Run 2 analysis, searching for dark matter with the CMS detector, as a testbed for Big Data technologies. We directly compare the traditional NTuple-based analysis with an equivalent analysis using Apache Spark on the Hadoop ecosystem and beyond. In both cases, we start the analysis with the official experiment data formats and produce publication physics plots. We will discuss advantages and disadvantages of each approach and give an outlook on further studies needed.

Computing and Software for Big Science | 2017

arXiv : HEPCloud, a New Paradigm for HEP Facilities: CMS Amazon Web Services Investigation

Burt Holzman; M. Girone; Dirk Hufnagel; Dave Dykstra; Hyunwoo Kim; Steve Timm; Oliver Gutsche; S. Fuess; L. A. T. Bauerdick; Anthony Tiradani; Panagiotis Spentzouris; N. Magini; Brian Bockelman; Eric Wayne Vaandering; Robert Kennedy; D. Mason; G. Garzoglio; I. Fisk

Historically, high energy physics computing has been performed on large purpose-built computing systems. These began as single-site compute facilities, but have evolved into the distributed computing grids used today. Recently, there has been an exponential increase in the capacity and capability of commercial clouds. Cloud resources are highly virtualized and intended to be able to be flexibly deployed for a variety of computing tasks. There is a growing interest among the cloud providers to demonstrate the capability to perform large-scale scientific computing. In this paper, we discuss results from the CMS experiment using the Fermilab HEPCloud facility, which utilized both local Fermilab resources and virtual machines in the Amazon Web Services Elastic Compute Cloud. We discuss the planning, technical challenges, and lessons learned involved in performing physics workflows on a large-scale set of virtualized resources. In addition, we will discuss the economics and operational efficiencies when executing workflows both in the cloud and on dedicated resources.

Journal of Physics: Conference Series | 2012

No file left behind - monitoring transfer latencies in PhEDEx

T. Chwalek; R. Egeland; Oliver Gutsche; C-H Huang; R Kaselis; M. Klute; N Magini; F Moscato; S. Piperov; N Ratnikova; P Rossman; A. Sanchez-Hernandez; A. Sartirana; T. Wildish; M. Yang; Si Xie

The CMS experiment has to move Petabytes of data among dozens of computing centres with low latency in order to make ecient use of its resources. Transfer operations are well established to achieve the desired level of throughput, but operators lack a system to identify early on transfers that will need manual intervention to reach completion. File transfer latencies are sensitive to the underlying problems in the transfer infrastructure, and their measurement can be used as prompt trigger for preventive actions. For this reason, PhEDEx, the CMS transfer management system, has recently implemented a monitoring system to measure the transfer latencies at the level of individual files. For the first time now, the system can predict the completion time for the transfer of a data set. The operators can detect abnormal patterns in transfer latencies early, and correct the issues while the transfer is still in progress. Statistics are aggregated for blocks of files, recording a historical log to monitor the long-term evolution of transfer latencies, which are used as cumulative metrics to evaluate the performance of the transfer infrastructure, and to plan the global data placement strategy. In this contribution, we present the typical patterns of transfer latencies that may be identified with the latency monitor, and we show how we are able to detect the sources of latency arising from the underlying infrastructure (such as stuck files) which need operator intervention.

Journal of Physics: Conference Series | 2010

Validation of software releases for CMS

Oliver Gutsche; Offline Projects

The CMS software stack currently consists of more than 2 Million lines of code developed by over 250 authors with a new version being released every week. CMS has setup a release validation process for quality assurance which enables the developers to compare to previous releases and references. This process provides the developers with reconstructed datasets of real data and MC samples. The samples span the whole range of detector effects and important physics signatures to benchmark the performance of the software. They are used to investigate interdependency effects of software packages and to find and fix bugs. The samples have to be available in a very short time after a release is published to fit into the streamlined CMS development cycle. The standard CMS processing infrastructure and dedicated resources at CERN and FNAL are used to achieve a very short turnaround of 24 hours. The here described release validation process is an integral part of CMS software development and contributes significantly to ensure stable production and analysis. Its success emphasizes the importance of a streamlined release validation process for projects with a large code basis and significant number of developers and can function as an example for future projects.

Explore More